Herausgeber 


T. LÄNGLE 


M. HEIZMANN 


FORUM BILDVERARBEITUNG 2022 
IMAGE PROCESSING FORUM 92 


SICHT zusishing 


T. Langle | M. Heizmann (Hrsg.) 


FORUM BILDVERARBEITUNG 2022 
IMAGE PROCESSING FORUM 2022 


FORUM BILDVERARBEITUNG 2022 
IMAGE PROCESSING FORUM 2022 


Herausgegeben von 
T. Längle und M. Heizmann 


SICHT zusishing 


Impressum 
NC Scientific 
Publishing 
Karlsruher Institut für Technologie (KIT) 
KIT Scientific Publishing 


StraBe am Forum 2 
D-76131 Karlsruhe 


KIT Scientific Publishing is a registered trademark 
of Karlsruhe Institute of Technology. 
Reprint using the book cover is not allowed. 


www.ksp.kit.edu 


© OO This document - excluding parts marked otherwise, the cover, pictures and graphs — 
baa is licensed under a Creative Commons Attribution-Share Alike 4.0 International License 
(CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en 


© OXO) The cover page is licensed under a Creative Commons 
Rare Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): 
https://creativecommons.org/licenses/by-nd/4.0/deed.en 


Print on Demand 2022 - Gedruckt auf FSC-zertifiziertem Papier 


ISSN 2510-7224 
ISBN 978-3-7315-1237-0 
DOI 10.5445/KSP/1000150865 


Vorwort 


Bildverarbeitung ist definitionsgemäß die Wissenschaft von der Verar- 
beitung von Bildern. Damit verknüpft das Fachgebiet die Sensorik von 
Kameras — bildgebender Sensorik - mit der Verarbeitung der aufge- 
nommenen Sensordaten - den Bildern. Aus dieser Verknüpfung resul- 
tiert der besondere Reiz dieser Disziplin. Bildern begegnet der Mensch 
ständig, schon weil das Sehen die wichtigste Informationsquelle als 
Handlungsgrundlage für den Menschen bildet. Durch die Verwendung 
von Kameras eröffnen sich aber noch weitergehende Chancen, da die 
Bildgebung nicht auf die biologischen Beschränkungen des Auges be- 
grenzt sind: Hier sind beispielsweise die multi-/hyperspektrale Bilder- 
fassung, hohe Bildraten oder die Maßverkörperung durch das Pixelras- 
ter der Kamera zu nennen. Da Kameras ähnlich anderen Produkten der 
Elektronik von der hohen Effizienz der Elektronikfertigung profitieren, 
werden auch hochwertige Kameras tendenziell immer günstiger. 

In zahlreichen Aufgabenstellungen hat der Mensch eine intuitive 
Vorstellung, wie eine bestimmte Information aus einem Bild gewon- 
nen werden kann. Beispiele sind die Erkennung von Defekten auf 
Oberflächen oder die Orientierung im Raum, für welche der Mensch 
unterschiedliche Auswertemethoden ganz intuitiv und ohne Kennt- 
nis von einer konkreten „Algorithmik“ kombiniert. Die Verarbeitung 
von Bildern auf technische Systeme zu übertragen, kann immer noch 
herausfordernd sein. Während manche Aufgabenstellungen als weitge- 
hend gelöst gelten, gibt es immer noch Herausforderungen, die dazu 
führen, dass Forschungsaktivitäten zu neuen Lösungen und erschließ- 
baren Anwendungsfeldern führen. Hier zeigt sich ein weiterer Reiz der 
Bildverarbeitung, da Lösungsansätze oft Bausteine aus zahlreichen un- 
terschiedlichen Disziplinen — von der Bildgebung über die Systemtheo- 
rie, die Signalverarbeitung bis hin zur Informationsfusion und zu ma- 
schinellem Lernen - zielführend verknüpfen. 

Ziel des „Forums Bildverarbeitung” ist es, solche interessanten Auf- 
gabenstellungen und passende Lösungsansätze einem breiten Publi- 
kum zugänglich zu machen und den fachlichen Austausch zu den 


Vorwort 


zahlreichen Facetten der Bildverarbeitung anzuregen. Die Veranstal- 
tung findet in jedem zweiten Jahr seit 2010 statt und wird inzwischen 
gemeinsam vom Geschäftsfeld Inspektion und Optronische Systeme 
des Fraunhofer-Instituts für Optronik, Systemtechnik und Bildverarbei- 
tung IOSB und dem Institut für Industrielle Informationstechnik IIIT 
des Karlsruher Instituts für Technologie KIT organisiert. Auch in die- 
sem Jahr haben erfreulich viele Autoren dem Aufruf zur Einreichung 
von Beiträgen geantwortet. Der Programmausschuss konnte aus den 
Einreichungen nach sorgfältiger Begutachtung 24 hochwertige Beiträge 
auswählen und den Themenfeldern 


e Bildgewinnung, 

e Qualitätssicherung, 

e Sortierung, 

e Bildverarbeitung, 

e Fahrzeuge sowie 

e Mess- und Automatisierungstechnik. 


zuordnen. Zur überwiegenden Zahl der Beiträge wurden Aufsätze er- 
stellt, die im vorliegenden Tagungsband enthalten sind. 

Wir danken den Autoren für ihre sorgfältig erstellten Aufsätze, den 
Mitgliedern des Programmausschusses für die aktive Ansprache von 
Autoren und ihre wertvolle Expertise bei der Begutachtung der Ein- 
reichungen und allen, die durch ihre Anwesenheit zum Gelingen des 
Forums Bildverarbeitung beitragen. Für die Organisation der Veran- 
staltung und die technische Unterstützung bei der Erstellung des Ta- 
gungsbands bedanken wir uns bei Britta Ost, Felix Lehnerer, Jürgen 
Hock und Alexander Enderle. 
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Mehrwellenlängen-Verfahren zur 
strukturierten Beleuchtung 


Multi-wavelength approach to structured illumination 


Marcus Petz!, Paul-Felix Hagen? und Rainer Tutsch! 


1 Technische Universität Braunschweig, Institut für Produktionsmesstechnik, 
Schleinitzstraße 20, 38102 Braunschweig 
2 Institut für Mess- und Regelungstechnik, Leibniz Universität Hannover, 
An der Universität 1, 30823 Garbsen 


Zusammenfassung Auf dem Prinzip der strukturierten Be- 
leuchtung basierende optische Messverfahren wenden häufig 
Phasenschiebeverfahren zur optischen Ortskodierung an. 
Während diese Ansätze in vielen Anwendungen eine effiziente 
und hochpräzise Kodierung ermöglichen, stoßen sie insbe- 
sondere bei der Überlagerung verschiedener Signalanteile an 
ihre Grenzen. Deratige Signalüberlagerungen entstehen bei 
der Streifenprojektion etwa durch Mehrfachreflexionen an der 
Werkstückoberfläche oder bei deflektometrischen Verfahren 
durch die Überlagerung von Vorder- und Rückseitenreflexen an 
transparenten Prüflingen. Vor dem Hintergrund dieser Proble- 
matik wird im vorliegenden Beitrag ein neuartiger Ansatz auf 
dem Gebiet der strukturierten Beleuchtung vorgestellt, welcher 
— basierend auf vergleichbaren Ansätzen aus dem Bereich der 
absolutmessenden Interferometrie — eine räumliche Kodierung 
durch eine Musterfolge mit ansteigender Ortsfrequenz umsetzt. 
Neben den Grundlagen des Verfahrens werden erste experi- 
mentelle Ergebnisse vorgestellt, welche aufzeigen, dass das 
Verfahren eine hohe Genauigkeit ermöglicht und zudem die 
Trennung überlagerter Signale mit hoher Qualität gelingt. 


Schlüsselwörter Strukturierte Beleuchtung, optische Ortskodie- 
rung, Mehrwellenlängen-Kodierung, Streifenprojektion, Deflek- 
tometrie 


M. Petz, P.-F. Hagen und R. Tutsch 


Abstract Optical measuring methods based on the principle of 
structured illumination frequently apply phase shift evaluation 
for optical spatial coding. While these approaches allow for 
efficient and high-precision coding in many applications, they 
reach their limits in particular when different signal components 
are superimposed. This kind of signal superimpositions occur 
in stripe projection, for example due to multiple reflections on 
the workpiece surface, or in deflectometric methods due to the 
superimposition of front and rear side reflections on transpar- 
ent samples. Against the background of this problem, a new 
approach in the field of structured illumination is presented in 
this article, which — based on comparable approaches from the 
field of absolute measuring interferometry — implements a spa- 
tial coding by a pattern sequence with increasing spatial fre- 
quency. In addition to the basics of the method, first experimen- 
tal results are presented, which show that the method enables a 
high level of accuracy and that the separation of superimposed 
signals succeeds with high quality. 


Keywords Structured illumination, optical spatial coding, 
multi-wavelength coding, fringe projection, deflectometry 


1 Einleitung 


Bei optischen Messverfahren wie der auf photogrammetrischen Prin- 
zipien basierenden Streifenprojektion oder der phasenmessenden De- 
flektometrie wird eine optische Ortskodierung benötigt. Im Fall der 
Streifenprojektion erfolgt diese durch Projektion geeigneter Muster auf 
die Werkstückoberfläche, im Fall der Deflektometrie wird hingegen 
meist ein als Referenzmusterebene dienender Monitor mit entspre- 
chenden Mustern beaufschlagt. 

Die überwiegende Zahl der dabei zur optischen Kodierung genutz- 
ten Verfahren basiert auf dem Phasenschiebeprinzip, bei welchem eine 
definierte Anzahl sinusförmiger Streifenmuster mit einheitlicher Orts- 
frequenz aber mit um definierte Winkel verschobener Phasenlage als 
Mustersequenz aufgezeichnet wird [1]. Hieraus lässt sich zunächst eine 
2rr-periodische relative Phase als im Messbereich mehrdeutige Ortsin- 
formation berechnen. 


Mehrwellenlängen-Verfahren 


Zur Entfaltung der relativen Phase sind unterschiedliche Ansätze ge- 
bräuchlich, wobei die Wiederholung der Phasenschiebemessung mit in 
der Regel drei geringfügig unterschiedlichen Ortsfrequenzen und die 
Auswertung der daraus resultierenden Schwebungssignale als ein vor- 
teilhafter Ansatz erscheint [2]. In Kombination mit dem auf diesem 
Anwendungsgebiet gebräuchlichsten Phasenschiebeansatz, dem sym- 
metrischen 4-Schritt-Algorithmus, besteht eine vollständige Musterse- 
quenz zur Kodierung entlang einer Koordinatenachse demnach aus 12 
Bildern - drei unterschiedliche Ortsfrequenzen jeweils mit den Phasen- 
lagen 0°, 90° 180° und 270°. Dieser Kodierungsansatz wird im Folgen- 
den als Referenz herangezogen. 

Die umrissenen Phasenschiebeverfahren zeichnen sich durch eine 
vergleichsweise kurze Messdauer aufgrund der überschaubaren Mus- 
teranzahl, durch eine wenig rechenintensive und somit schnelle algo- 
rithmische Messdatenauswertung und nicht zuletzt durch eine hohe 
Auflösung und Genauigkeit der Ortskodierung aus. Eine in der prakti- 
schen Anwendung nicht unproblematische Eigenschaft dieser Verfah- 
rensklasse besteht jedoch darin, dass sie nicht robust gegenüber der 
Überlagerung unterschiedlicher Signalanteile ist [3]. Werden also un- 
terschiedliche Ortsbereiche der Muster auf dieselbe Stelle der Ober- 
fläche beziehungsweise des Detektors abgebildet, resultiert daraus eine 
nicht behebbare Verfälschung der Ortsinformation. 

Eine entsprechende Signalüberlagerung tritt etwa bei der Streifen- 
projektion auf, wenn das projizierte Licht an der Werkstückoberfläche 
teilweise gerichtet reflektiert wird und in der Folge auf einen anderen 
Bereich der Werkstückoberfläche trifft [4]. Im Fall der Deflektometrie 
tritt eine Signalüberlagerung insbesondere dann auf, wenn transparen- 
te Objekte wie etwa optische Linsen in Reflexion gemessen werden sol- 
len [3]. In der Regel wird im Bild der Kamera dann eine Überlagerung 
von Vorder- und Rückseitenreflex beobachtet. Um dieses Problem ab- 
zumildern sind sowohl für die Streifenprojektion als auch für die De- 
flektometrie Ansätze beschrieben, bei welchen das Muster derart lokal 
maskiert wird, dass eine Signalüberlagerung soweit wie möglich ver- 
mieden wird [3,4]. Dieses Vorgehen erhöht jedoch in jedem Fall die 
Messdauer signifikant, da die Messung mit einer mehr oder weniger 
hohen Anzahl unterschiedlich maskierter Muster wiederholt werden 
muss. Zudem ist die Bestimmung einer optimalen, an den jeweiligen 
Prüfling angepassten Maskierungssequenz zeitaufwendig und nicht 
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trivial. 

Im vorliegenden Beitrag wird aus den genannten Gründen erstmals 
ein anderer, neuartiger Ansatz zur optischen Ortskodierung mittels si- 
nusförmiger Streifenmuster vorgestellt. Dieser ist inspiriert von Mehr- 
wellenlängen-Verfahren wie sie auf dem Gebiet der absolutmessenden 
Interferometrie zur Anwendung kommen, bei welchen etwa mittels 
einer durchstimmbaren Laserquelle eine Abstandskodierung entlang 
der Strahlachse durchgeführt werden kann [5]. Das Grundprinzip be- 
steht darin, dass beim Durchstimmen mit geeigneten Wellenlängen ei- 
ne linear vom Abstand zur Strahlungsquelle abhängige Oszillations- 
frequenz des Interferenzsignals detektierbar ist. Dieses Grundprinzip 
wird im Folgenden auf die einachsige Ortskodierung mittels einer in 
ihrer Ortsfrequenz variierten Mustersequenz übertragen. Es werden 
die Grundlagen des Kodierungsverfahrens sowie der algorithmischen 
Auswertung vorgestellt und es werden erste experimentelle Ergebnisse 
präsentiert, welche aufzeigen, dass das neuartige Kodierungsverfahren 
eine mit dem oben beschriebenen Heterodyn-Phasenschiebverfahren 
vergleichbare Auflösung und Genauigkeit ermöglicht und dass ferner 
die Trennung überlagerter Signale mit hoher Güte gelingt. 


2 Grundlagen des Mehrwellenlängen-Ansatzes 


In der Interferometrie ist die Phase eines periodischen Signals in 
Abhängigkeit von der Wellenlänge A, der Weglänge L und dem Bre- 
chungsindex n entsprechend Gleichung 1 bestimmt [6]. 


Pen (1) 


Bei einer konstanten Wellenlänge A und konstantem Brechungsindex n 
ergibt sich somit nach einer Weglänge L jeweils eine charakteristische 
Phase p. Abhängig von der Phase kann mit Gleichung 2 die Signal- 
intensität I berechnet werden, die zudem vom Interferenzkontrast + 
abhängt [6]. 


27 
I=Ip- ER a Ee (2) 


Mehrwellenlängen-Verfahren 


Wird nun die Wellenlänge A variiert und dabei die Weglänge L kon- 
stant gehalten, so ergibt sich über den betrachteten Zeitraum ei- 
ne Phasenänderung. Wird die Wellenlänge A derart durchgestimmt, 
dass 1/A eine lineare Anderung erfährt, so ist die resultierende Pha- 
senänderung gemäß Gleichung 1 linear. Eine lineare Phasenänderung 
ist gleichbedeutend mit einer harmonischen Schwingung mit einer 
weglängenabhängigen Frequenz. Die Frequenz und Phasenlage der 
resultierenden harmonischen Schwingung können gemessen werden 
und sind bei geeigneter Wahl der Wellenlängen A; ein lineares Maß für 
die Weglänge L. 

Wird dieses aus der Interferometrie stammende Verfahren auf 
räumliche Signale übertragen, so entfällt zunächst der Faktor 1/2 im 
Nenner von Gleichung 1, da bei der direkten Detektion von Ortsfre- 
quenzen anders als in der Interferometrie keine doppelte Weglänge in 
Form von Hin- und Rückweg berücksichtigt werden muss. Anstelle der 
Weglange L wird im Weiteren die Position X entlang der Kodierungs- 
richtung, also die Ortskoordinate innerhalb des Musters betrachtet. Für 
die betrachteten räumliche Signale entfällt zudem der Brechungsindex 
n. Entsprechend vereinfacht sich Gleichung 1 für den hier betrachteten 
Fall zu Gleichung 3. 


27 


g(x) =x 3) 


An die Stelle des Interferenzkontrasts y in Gleichung 2 tritt die Modu- 
lation M. Damit folgt fiir die beobachtbare Intensitat I (X) nachfolgen- 
de Gleichung 4. 


I(X) = Ío- 14+ M-cos -X (4) 
Eine Veranschaulichung einer entsprechenden Wellenlängensequenz, 
anhand welcher die weitere Diskussion nachvollzogen werden kann, ist 
in Abbildung 1 dargestellt. Zur Visualisierung wurden als Grenzen des 
Wellenlängenspektrums Amin = 20 px, Amax = 36 px und die Anzahl N 
der diskreten Wellenlängen zu N = 32 gewählt. Die zwischen Amin und 
Amax liegenden Wellenlängen A; sind wie oben gefordert derart abge- 
stuft, dass sich die Ortsfrequenz 1/A; linear ändert. In Abbildung 1 ist 
die resultierende Musterabfolge im Intervall X € [0 px; 800 px] darge- 
stellt. Zunächst ist festzustellen, dass für X = 0 alle Signale dieselbe 
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f=0 min ~ 0,0645 


fmax = 0,258 ÍNyquist =0,5 
Messbereich 


Musterindex i 
a 


100 200 300 400 500 600 700 800 
Bildposition X / px 


Abbildung 1: Beispielhafte Wellenlangen-Sequenz fiir Ortskodierung nach dem Mehr- 
wellenlängen-Verfahren. 


Anfangsphase aufweisen. Die am Ort X = 0 aus der Sequenz resul- 
tierende harmonische Schwingung weist daher die nicht auswertbare 
Frequenz f = 0 auf. Da sich die Frequenz mit zunehmender Koordi- 
nate X linear erhöht, wird irgendwann die korrekt erfassbare Nyquist- 
Frequenz fNyquist erreicht. Für die Frequenz f in Abhängigkeit der re- 
gistrierten Periodenanzahl P der harmonischen Schwingung und der 
Anzahl N unterschiedlicher Musterwellenlängen gilt Gleichung 5. 


P 

fay (5) 
Für das in Abbildung 1 exemplarisch betrachtete Muster ergibt sich für 
den Fall der Nyquist-Frequenz fNyquist = 0,5 die Anzahl der zwischen 
erstem und letztem Sample registrierten Perioden P durch Umstellen 
von Gleichung 5 somit zu P = 0,5-31 = 15,5. Für die Phasendifferenz 
Ad zwischen Amn und Pj,,., gilt ausgehend von Gleichung 3 ferner 
allgemein nachfolgende Gleichung 6. 


Ab = Pann — Pama = ( SL ) X (6) 


Amin Amax 
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Mit der für die Nyquistfrequenz hier geltenden Bedingung AP = 
15,5-27t kann der Ort X, an welchem die Nyquistfrequenz für obige 
Musterabfolge erreicht wird daher, wie in Abbildung 1 eingetragen, 
zu X = 697,5 px ermittelt werden. Jenseits dieser Koordinate fallen die 
Frequenzen wieder linear ab, während die Phasenlage um 180° gedreht 
ist. Innerhalb des theoretischen Frequenzbereichs von f = Obis f = 0,5 
sind sinnvollerweise weitere Anforderungen an die aufgezeichneten 
Schwingungssignale zu stellen, welche den nutzbaren Ortsbereich wei- 
ter einschränken. So ist es im Interesse einer möglichst zuverlässigen 
Frequenz- und Phasenmessung etwa sinnvoll, eine Mindestanzahl auf- 
gezeichneter Perioden P zu fordern. Mit der Forderung Pmin = 2 ergibt 
sich für die betrachtete Sequenz eine Koordinate von Xmin = 90 px. Um 
am anderen Ende des Messbereichs einen ausreichenden Abstand von 
der Nyquistfrequenz einzuhalten, ist zudem die Forderung einer Min- 
destanzahl an Samples pro Signalperiode zweckmäßig. Mit S als der 
Anzahl der Samples pro Signalperiode kann Gleichung 5 zu Gleichung 
7 umgeschrieben werden. 


P 
Mit der zweckmäßigen Forderung Smin = 4 ergibt sich somit Pmax = 8 
und in der Folge Xmax = 360 px. Im vorliegenden Fall ergäbe sich dem- 
nach ein effektiv nutzbarer Messbereich vom AX = Xmax — Xmin = 
270 px. Die hier zur Veranschaulichung genutzten Parameter sind folg- 
lich für den praktischen Einsatz des Verfahrens nicht zweckmäßig 
gewählt. Mit den für die im Weiteren vorgestellten experimentellen 
Untersuchungen gewählten Parametern Amin = 20 px, Amax = 21 px, 
N = 48, Pmin = 2 und Smin = 4 entsteht hingegen ein nutzbarer Ko- 
dierungsbereich von AX = Xmax — Xmin = 5040 px — 840 px = 4200 px. 
Hiermit kann folglich selbst ein 4K Monitor eindeutig ortskodiert wer- 
den oder alternativ können, wie in Abschnitt 4.1 gezeigt, zwei Full HD 
Monitore mit nicht-überlappenden Frequenzbereichen kodiert werden. 


3 Datenauswertung 


Die rechnerische Auswertung der aufgenommen Bildsequenzen be- 
steht im Wesentlichen aus der Bestimmung von Frequenz und Phase 
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der für jeden Bildpunkt aufgezeichneten harmonischen Schwingung. 
Im Folgenden werden zwei bereits experimentell näher untersuchte 
Fälle unterschieden, nämlich erstens die Messung ohne Überlagerung 
verschiedener Ortsinformationen, für welche die Bestimmung der Pa- 
rameter nur einer harmonischen Schwingung erforderlich ist, sowie 
zweitens der Fall der Überlagerung zweier Ortsinformationen, für wel- 
chen die Parameter zweier unterschiedlicher harmonischer Schwin- 
gungen ermittelt werden müssen. 


3.1 Auswertung einer Frequenz 


Für jeden Messpunkt werden aus der Aufzeichnung der Bildsequenz 
N diskrete Intensitätswerte y; gewonnen, welche in ihrer zeitlichen Ab- 
folge eine harmonische Schwingung repräsentieren. Als Maß für die 
interessierende Ortskoordinate X dient, aufgrund des gegenüber der 
Frequenz besseren Signal-Rausch-Verhältnisses, der Phasenwinkel des 
Signals, wobei jedoch auch die Frequenz benötigt wird, um die Ent- 
faltung der periodischen relativen Phase vornehmen zu können. Auf- 
grund der eher geringen Anzahl an Stützstellen N zeigt sich, dass ei- 
ne zur Lösung des Problems naheliegende Fouriertransformation eine 
nur sehr geringe und letztlich trotz Interpolation nicht ausreichende 
Frequenzauflösung bietet. Es ist daher erforderlich, eine Sinusfunktion 
iterativ an die Messdaten anzupassen. Hierfür werden jedoch, um ein 
gutes Konvergenzverhalten zu erzielen, hinreichend gute Startwerte für 
die freien Parameter der Zielfunktion benötigt. Diese lassen sich mit- 
tels einer Fouriertransformation mit ausreichender Güte bestimmen, 
so dass der Auswerteprozess im Wesentlichen aus der Abfolge einer 
Fouriertransformation und eines iterativen Sinusfits besteht. 
Offenkundig ist damit der Rechenaufwand für das vorgestellte 
Mehrwellenlängen-Verfahren signifikant höher als jener für das eta- 
blierte Phasenschiebeverfahren. In einer ersten Implementierung wur- 
de das oben umrissene Auswerteverfahren in den beiden wesentlichen 
Teilen, der Fast Fourier Transformation (FFT) sowie des Sinusfits, in 
einer C++ Dynamic Link Library realisiert, wobei von der Möglichkeit 
der Parallelisierung von Teilaufgaben gebraucht gemacht wurde. Die 
bislang erzielten Auswertedauern liegen exemplarisch auf einem Pro- 
zessor vom Typ AMD Ryzen™ 7 5800H bei rund 5 Sekunden pro 
eine Million Messpunkte. Damit liegt die Auswertedauer bereits in 
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dieser frühen Erprobungsphase in einer durchaus praxistauglichen 
Größenordnung. Eine nochmals deutliche Reduzierung der Auswer- 
tedauer wäre mit überschaubarem Aufwand durch Nutzung von GPU 
Computing insbesondere für die FFT erreichbar. 


3.2 Auswertung zweier Frequenzen 


Sofern an einem Ort der Bildsequenz zwei Ortsinformation zur 
Überlagerung kommen, sind statt der Parameter für nur eine harmoni- 
sche Schwingung die Parameter zweier harmonischer Schwingungen 
zu berechnen. Unter günstigen Randbedingungen - das heißt insbe- 
sondere sofern die überlagerten Frequenzen hinreichend weit vonein- 
ander entfernt liegen und eine ähnlich hohe Modulation aufweisen — 
ist es möglich, den zuvor beschriebenen Auswerteablauf im Grund- 
satz beizubehalten. In diesem Fall werden aus dem FFT-Spektrum die 
beiden lokalen Maxima mit den größten Amplituden extrahiert und 
direkt als Startwerte für das Optimierungsproblem genutzt. Es werden 
im Rahmen der Optimierung nach Gauß-Newton für jede der beiden 
Schwingungen eine individuelle Modulation, Frequenz und Phase an- 
gesetzt, während der Offset nur summarisch für beide Schwingungen 
berechenbar ist. 

Die Erfahrungen verschiedener Testmessungen zeigen, dass für an- 
spruchsvollere Szenarien — also insbesondere geringer Frequenzab- 
stand und/oder deutlich unterschiedliche Modulation beider Signale — 
die Wahrscheinlichkeit für ein Scheitern dieses direkten Ansatzes deut- 
lich zunimmt. Zum einen werden dann mit zunehmender Häufigkeit 
zu stark abweichende Startwerte aus der FFT ermittelt, zum anderen 
zeigt das Gauß-Newton-Verfahren zunehmend problematisches Kon- 
vergenzverhalten. Das Zusammenwirken von FFT und Sinusfit kann 
jedoch im Grundsatz beibehalten werden, nur dass dieses vorteilhaf- 
terweise in mehrere Teilschritte untergliedert wird und als finaler Op- 
timierungsschritt ein Downhill-Simplex-Verfahren eingesetzt wird. 


4 Messergebnisse 


Die bislang nach dem vorgestellten Ansatz durchgeführten Messungen 
verfolgen im Wesentlichen zwei Ziele. Erstens soll untersucht werden, 
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ob das Verfahren grundsätzlich eine vergleichbar hohe Kodierungsgüte 
wie die etablierten Phasenschiebeverfahren ermöglicht. Zweitens soll 
überprüft werden, ob eine Trennung zunächst zweier überlagerter 
Ortsinformationen — wie sie typischerweise bei der deflektometrischen 
Linsenmessung auftritt - mit grundsätzlich vergleichbarer Qualität wie 
bei einer Messung ohne Überlagerung möglich ist. 


4.1 Messung ohne Signalüberlagerung 


Der Messaufbau des Szenarios ohne Signalüberlagerung besteht aus 
einer elektronischen Kamera vom Typ IDS UI5240SE-M mit einem 
Objektiv FUJINON HF9HA-1B, welche direkt die vollständige Bild- 
schirmfläche eines Samsung PLS-Monitors vom Typ S24E650 mit ei- 
ner Auflösung von 1920 x 1200 px beobachtet. Als Referenzkodierungs- 
ansatz wird die heterodyne Phasenschiebetechnik nach [2] mit Wel- 
lenlängen von A; = 20 px, A2 = 21,5 px und A3 = 23 px verwendet. Für 
den Mehrwellenlangen-Ansatz wurde der Parametersatz Amin = 20 px, 
Amax = 21 px, N = 48 und Pmin = 5 gewählt. Die vollständige Phasen- 
schiebungssequenz besteht somit aus insgesamt 12 Bildern, während 
die Mehrwellenlängensequenz aus 48 Bildern besteht. In beiden Fällen 
wird jedes Bild durch Addition von vier 12-Bit-Bildern der Kamera er- 
halten, wodurch Sätze synthetischer 14-Bit-Bilder erzeugt werden. 

Als Maß für die Genauigkeit wird der erhaltene relative Positions- 
fehler verwendet. Dieser sei hier definiert als das Verhältnis von Pha- 
senabweichungen A® und der Spanne der Phasenwerten ®max — Pmin 
in jeder Messung. Da die realen Messungen nicht nur hochfrequen- 
te, rauschartige Abweichungen enthalten, sondern auch niederfrequen- 
te Abweichungen, die sich aus der Aufbaugeometrie und optischen 
Verzerrungen ergeben, wird der relevante hochfrequente Anteil durch 
Hochpassfilterung der Phasendaten bestimmt. Beide Messungen lie- 
fern ca. 654.000 Messpunkte und einen in sehr guter Näherung nor- 
malverteilten relativen Positionsfehler. Dabei beträgt die Standardab- 
weichung für den Mehrwellenlängen-Ansatz ca. 7,79 - 10°, während 
das Phasenschiebeverfahren einen Wert von ca. 8,54 -1076 liefert. Aller- 
dings sollte hierbei beachtet werden, dass die Anzahl der Bilder beim 
Mehrwellenlängen-Ansatz um den Faktor 4 größer ist. Dennoch lässt 
sich festhalten, dass das Mehrwellenlängen-Verfahren trotz des etwas 
höheren Messaufwand eine gegenüber etablierten Phasenschiebever- 
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fahren konkurrenzfähige Ortskodierung ermöglicht. 


4.2 Messung mit Signalüberlagerung 


Für die Untersuchungen zur Trennbarkeit zweier überlagerter Signal- 
anteile wurde ein Aufbau aus zwei unter einem Winkel von 90° zu- 
einander angeordneten Monitoren vom oben genannten Typ in Ver- 
bindung mit einer Kamera vom Typ IDS UI3070CP-M und einem Ob- 
jektiv FUJINON HF16HA-1B eingesetzt. Unmittelbar vor dem Objek- 
tiv ist ein Strahlteilerwürfel derart positioniert, dass die Kamera eine 
Überlagerung der beiden Displays beobachtet, sobald beide aktiviert 
sind. Anhand dieses Aufbaus wurden zwei Szenarien untersucht. Zum 
ersten der Fall, dass die beiden Monitore mit einer Mustersequenz be- 
aufschlagt werden, welche zu einem nicht-überlappenden Frequenzbe- 
reich beider Monitore führt. Zum zweiten das anspruchsvollere Sze- 
nario, dass beide Monitore mittels desselben Frequenzbandes kodiert 
werden. 

Der für das erste Szenario verwendete Parametersatz lautet wie zu- 
vor Amin = 20 px, Amax = 21 px, N = 48. Mit Pmina = 2,4 für den 
in Transmission durch den Strahlteiler beobachteten Monitor 1 und 
Pmina = 7,4 für den gespiegelt beobachteten Monitor 2 liefert das 
Muster zwei nicht überlappende Frequenzbereiche, wovon der erste 
bei Xmin = 1176 px und der zweite bei Xmin,2 = 3108 px beginnt. 

Um die Güte der Signaltrennung zu bewerten, wurden zusätzlich 
Messungen mit jeweils nur einem aktivierten Monitor durchgeführt, 
so dass Referenzdaten ohne Signalüberlagerung zur Verfügung stehen. 
Die Differenzen zwischen den rechnerisch separierten Phasendaten 
aus der Messung mit Überlagerung sowie den jeweils korrespondie- 
renden Einzelmessungen zeigen eine hervorragende Übereinstimmung 
der Phaseninformation. Der relative Phasenfehler, zu verstehen als 
Ae = AP/2rt, weist für Monitor 1 eine Standardabweichung von 
OA®,.,1 © 1/1107 und für den deutlich dunkler erscheinenden Monitor 
2 von O@,,,,2 © 1/460 auf. Diese Werte stimmen ungefähr mit jenen 
überein, die ausgehend vom Grundrauschen des Verfahrens auch bei 
Subtraktion zweier Einzelmessungen ohne Signalüberlagerung zu er- 
warten wären. 

Werden beide Monitore mit demselben Frequenzbereich kodiert, so 
gelingt die Trennung der überlagerten Signalanteile im größten Teil 
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des Bildfeldes vergleichbar gut, wie zuvor beschrieben, da wegen des 
Strahlteilers Monitor 2 gegenüber Monitor 1 horizontal gespiegelt er- 
scheint. Lediglich in einem schmalen Bereich, in welchem die Fre- 
quenzen beider Signalanteile sehr dicht beisammen liegen, ermöglicht 
das derzeitige Berechnungsverfahren keine erfolgreich Signaltrennung. 
Der im beschriebenen Messaufbau für eine erfolgreiche Signaltren- 
nung derzeit erforderliche Signalversatz entspricht etwa 100 Pixel in 
der Monitorebene. Es ist davon auszugehen, dass eine Optimierung 
des Berechnungsverfahrens eine weitere Steigerung der Trennscharfe 
ermöglicht. 


5 Zusammenfassung 


Das vorgestellte Mehrwellenlängen-Verfahren stellt einen neuartigen 
Ansatz zur strukturierten Beleuchtung dar, welcher bei Messverfah- 
ren wie der Streifenprojektion und der Deflektometrie eingesetzt wer- 
den kann. Das Hauptmerkmal des Ansatzes besteht darin, dass er 
im Gegensatz zu den etablierten Phasenschiebetechniken mit der 
Überlagerung mehrerer Ortsinformationen umgehen kann. Neben den 
Grundlagen des Ansatzes werden Datenauswertungsverfahren für ein- 
zelne und doppelte Ortsinformationen pro Bildpunkt aufgezeigt. Für 
beide Fälle liegen experimentelle Daten vor, die das Potenzial des An- 
satzes aufzeigen. Es lässt sich festhalten, dass der Mehrwellenlängen- 
Ansatz eine optische Ortskodierung mit Unsicherheiten ermöglicht, 
welche vergleichbar mit jener etablierter Phasenschiebetechniken sind. 
Für den Fall der Signalüberlagerung zeigt das gegebene Beispiel, 
dass zwei überlagerte Datensätze effektiv und korrekt getrennt wer- 
den können, sofern die zu trennenden Ortsinformationen nicht zu 
ähnlich sind. Somit zeigt der Mehrwellenlängenansatz ein hohes Po- 
tenzial für spezielle Anwendungen im Bereich der Streifenprojektion 
und Deflektometrie, die mit den etablierten Phasenschiebetechniken 
nicht bewältigt werden können. 
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Abstract In this work, we propose a physics-enhanced two-to- 
one Y-neural network (two inputs and one output) for phase re- 
trieval of complex wavefronts from two diffraction patterns. The 
learnable parameters of the Y-net are optimized by minimizing a 
hybrid loss function, which evaluates the root-mean-square er- 
ror and normalized Pearson correlated coefficient on the two 
diffraction planes. An angular spectrum method network is de- 
signed for self-supervised training on the Y-net. Amplitudes and 
phases of wavefronts diffracted by a USAF-1951 resolution tar- 
get, a phase grating of 200 Ip/mm, and a skeletal muscle cell 
were retrieved using a Y-net with 100 learning iterations. Fast 
reconstructions could be realized without constraints or a priori 
knowledge of the samples. 


Keywords Coherent diffraction imaging, phase retrieval, deep 
neural network 


1 Introduction 


Retrieving the phase from diffraction patterns is a long-standing prob- 
lem. In the recorded intensity patterns, the object wavefront is super- 
imposed with its a phase-conjugated and for reconstructing the wave- 
front without conjugation, the phase needs to be retrieved. Conven- 
tional methods used constraints to iteratively solve the phase retrieval 
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problem. A priori knowledge of the object plane [1] or modulations ap- 
plied on the imaging path [2] [3] can be the constraints. Optimization 
iterations are needed. 

Deep learning is a powerful approach for solving optimization prob- 
lems. A convolutional neural network (CNN) is trained with a dataset 
for mapping input to output. CNNs are widely used in image process- 
ing, they have an end-to-end structure, which can be trained to retrieve 
a phase pattern from an intensity pattern [4] [5]. After training on a 
dataset, the reconstruction can be directly made by a CNN without fur- 
ther optimization. The phase retrieval problem has an explicit physical 
model and a CNN can be enhanced with the diffraction principle [5] 
in order to avoid training with thousands of patterns. However, the 
end-to-end structure of a CNN described in [6] limits the object to be 
phase-only. Splicing the phase and amplitude into one image seems to 
be a straightforward solution, but a CNN uses a convolution kernel for 
feature extraction. The connected edges of the amplitude and phase 
pattern may be convoluted with one kernel and generate data against 
the physical model. 

In this work, we propose a physics-enhanced neural network for re- 
trieving a complex wavefront from two axially displaced diffraction 
patterns. A two-to-one Y-net (two inputs and one output) is designed 
to retrieve the phase on the first plane. Then the complex wavefront is 
calculated with the retrieved phase and the square root of the recorded 
intensity pattern. An angular spectrum method (ASM) network is de- 
signed to calculate the wave propagation. The Y-net is trained with the 
diffraction between the two recording planes and produces a phase on 
the first plane, which can be used to generate two patterns on the two 
recording planes. The errors between generated and recorded patterns 
are evaluated with a hybrid loss function. The normalized Pearson 
correlation coefficient and root mean square error are used to build the 
hybrid loss function. The learnable parameters in the Y-net are opti- 
mized by gradient descent on the hybrid loss function. After training 
on a dataset, the Y-net can be generalized to retrieve complex wave- 
fronts without optimization. Reconstruction can also be made using 
an untrained Y-net. An amplitude-only UASF-1951 resolution chart, 
a phase grating, and a skeletal muscle cell are experimentally recon- 
structed using an untrained Y-net. 
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2 Y-net for retrieving the complex wavefront 


A schematic of the setup used for recording axially displaced diffrac- 
tion patterns is shown in Fig 1. The sample is illuminated by a plane 
wave and the diffraction patterns are recorded on two planes at dis- 
tances of z’and z’+z. To reconstruct the complex-valued object, the 
phase on the two diffraction patterns is retrieved using a Y-net. 

The proposed Y-net is a fusion of two U-nets. There are two down- 
sampling paths and one up-sampling path, which are composed of four 
down-sampling and corresponding up-sampling convolution blocks. 
In each convolution block, the information passes downstream along 
with two sets of batch normalization layers, rectified linear unit (ReLU) 
layer, and a convolution layer. The feature maps in each down- 
sampling block are extracted using a 3x3 convolution kernel with a 
stride of 2. In the bottleneck of the Y-net, the feature maps from the two 
down-sampling paths are connected as the input of the up-sampling 
path. Then the up sampling is made with transposed convolutions. 
There are residual layers and skip connections after the convolution 
blocks to make the deep Y-net easy to optimize by avoiding the vanish- 
ing gradients problem and mitigating the degradation problem. 

The schematic for training the Y-net is shown in Fig.1 (b). In the first 
training loop, the learnable parameters are randomly initialized. This 
initialization helps keeping the signal from expanding to an extremely 
high value or vanishing to zero. Then the learnable parameters are 
optimized by minimizing a hybrid loss function, which is built by fol- 
lowing the optical diffraction model. 

The hybrid loss function for the Y-net is a linear combination of 
the loss function on two diffraction patterns. The output of the Y- 
net is set to be the phase on the first diffraction pattern. The complex 
wavefronts on the two planes follow the Rayleigh-Sommerfeld diffrac- 
tion. By merging the phase p(x1,yı) with the first recorded inten- 
sity I, (x1,y1), we obtain the wavefront on the first plane u (x1,yı) = 
Ji (x1 yiexplie(x1,y1)] . After propagating u, (xı,yı) to the second 
plane, we obtain the wavefront u2(x2,y2) = propz{u1(xı,yı)}, where 
z is the distance between the two planes. For evaluating the differ- 
ences between u2(X2, y2)? and h(x2,y2), a loss function is built from 
the linear combination of the root-mean-square error (RMSE) and the 
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Figure 1: (a) Recording two patterns diffracted by a complex object; (b) Training the Y- 
net based on diffraction between the two planes; (c) Retrieving the phase on 
the first pattern. 


normalized Pearson correlation coefficient (PCC), 
Loss{ I, T} — IpccPCC{I, T} + IrmsERMSE({I, T} (1) 
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PCC{I,I'} = } (2) 
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where Ipcc and Irms£ are the relative weights of the normalized PCC 
and RMSE, m and n are integer Bu M and N are the numbers 
of pixels in the patterns, Inve and I}. are the average pixel values of 
the images. The PCC measures the linear similarity between the two 
patterns, which is evaluated by the ratio between the covariance of the 
pixel values and the product of their standard deviations. The PCC has 
a value between -1 and 1, where 1 represents two similar patterns. To 
perform gradient descent, the PCC operator is normalized as shown 


RMSE{I,I'} = (3) 
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in Eq.2. The normalized PCC has a value between 0 and 1, where 0 
represents a high similarity. The RMSE is used together with PCC to 
obtain a better convergence in a training loop. The RMSE compares 
every pixel value on the generated intensity and the captured ground 
truth. The scaling effect of the PCC can be reduced by using the RMSE 
evaluation. When the RMSE value is 0, the generated intensity and the 
captured ground truth are the same on every pixel of the image. 

In order to apply a sufficient constraint to the neural network, the 
loss function is also built on the second plane. The amplitude of 
the propagated wave uo(x2,y2) is replaced by YI(x2,y2) . Then 
the updated wavefront u4(x2,y2) is propagated to the first plane, 
ui (x1,yı) = prop—z{u5(x2,y2)}. The differences between ul (xuy)? 
and Iı(x1,yı) are evaluated. The hybrid loss function for training the 
Y-net is dıLossı{I}} + daLoss2{ l2 }, where d and dz are the weights of 
the loss on the two diffraction planes. Training the neural network is a 
process of optimizing the weights of each layer to minimize the predic- 
tion error between the outputs and ground truth. This is usually made 
by using gradient descent methods on the loss functions. In this work, 
the ADAM optimization is used for minimizing the hybrid loss func- 
tion on the two planes. A well-trained Y-net retrieves a phase following 
the diffraction principles between the two planes. 


3 Reconstructions in experiments 


Experimental results were obtained by using an amplitude-only USAF- 
1951 resolution test target, a phase grating, and a skeletal cell sample. 
The diffraction patterns were recorded using the setup shown in Fig. 
2(a). The samples were illuminated with a plane wave having wave- 
length 655 nm. The pixel size of the camera is 2 um. After capturing 
the first diffraction pattern, the camera was shifted for capturing the 
second. The distance between the two planes was 400 um. The size of 
the diffraction patterns was 512x512 pixel, this is a compromise for ob- 
taining good resolution under fast training. Better results could be ob- 
tained using more pixels, but in this case a longer training time would 
be necessary. 

Figs. 2(b) and (c) show the recorded patterns of the USAF-1951 
resolution target. The first pattern was recorded at 4.4 mm distance 
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Figure 2: Experimental reconstruction for an amplitude-only USAF-1951 resolution tar- 
get. (a) Schematic of the experimental setup; (b) and (c) The recorded diffrac- 
tion patterns on the two planes; (d) Retrieved phase on the plane of (b); (e) and 
(f) Amplitude and phase of the reconstruction using the Y-net with 100 itera- 
tions; (h) and (i) Amplitude and phase of the reconstruction by propagating 
the first diffraction pattern. 
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from the object. An untrained Y-net was used for reconstruction. Self- 
supervised learning was performed by optimizing the hybrid loss func- 
tion with 100 iterations. With the ASM network, the Y-net learns to 
retrieve a phase following the diffraction between the two recording 
planes. As shown in Fig. 2(d), the feature of the object can be dis- 
tinguished from the retrieved phase. The complex wavefront on the 
first plane is calculated by multiplying the retrieved phase and the 
recorded amplitude. The phase and amplitude components are then 
reconstructed after propagating the calculated wavefront to the object 
plane. The intensity and the phase of the reconstruction are shown in 
Figs. 2(e) and (f). The sixth element of group five in the USAF-1951 
target was resolved (line width of 8.77 um). The reconstruction of the 
complex wavefront was made using the untrained Y-net without a pri- 
ori knowledge. Figs. 2(h) and (i) shows the reconstruction of intensity 
and phase obtained by simply propagating the first diffraction pattern 
to the object plane. The intensity is not correctly reconstructed due to 
the presence of the conjugated wavefront. 

A phase grating was also investigated with the same experimental 
setup shown in Fig. 2(a). The phase grating has a period of 5 um (200 
Ip/mm). The first pattern was captured at a distance of 4.6 mm from 
the phase grating. Then the camera was shifted 400 um for recording 
the second pattern. After self-supervised learning (100 iterations), the 
phase distributions of the gratings was reconstructed (see Figs. 3(b)). 
In this experiment, the phase grating cannot be reconstructed using 
simple propagation of the recorded diffraction pattern (Figs. 3(c), (d)). 

A skeletal muscle cell was used in another experiment, to further 
demonstrate the capability of the Y-net. In this case the sample was il- 
luminated with a plane wave having wavelength of 632.8 nm. The pixel 
size of the camera was 5.86 um. The first diffraction pattern was cap- 
tured 39.4 mm away from the specimen, this distance was numerically 
determined by back propagating the retrieved wave from the recording 
plane to the object plane. Then the camera was shifted 1 mm for cap- 
turing the second pattern. The phase at the first plane was retrieved 
after training the Y-net with 100 iterations. The reconstruction of the 
sample is obtained by propagating the retrieved wavefront. The recon- 
structed amplitude and phase of the skeletal muscle cell are shown in 
Figs. 3(e) and (f). The amplitude and phase show different structures 
of the skeletal muscle. 
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500um 


Figure 3: Experimental reconstruction for the phase grating and skeletal muscle cell sam- 
ple. (a), (e) and (b), (f) Amplitude and phase of the reconstruction using the 
Y-net with 100 iterations; (c), (j) and (d), (h) Amplitude and phase of the recon- 
struction by propagating the first diffraction pattern. 


4 Conclusion 


Y-net is proposed to efficiently reconstruct complex wavefronts. With 
self-supervised training through an ASM network, the Y-net learns the 
diffraction between the two planes. Only two diffraction patterns are 
needed for the reconstruction. The two patterns may also be simul- 
taneously captured using two cameras and one beam splitter. Then a 
well-trained Y-net may realize a quasi-real-time phase retrieval. The Y- 
net can be trained on a big dataset for the best generalization. The Y-net 
has a promising potential in the investigation of both timely and spa- 
tially varying physical processes. The large-scale complex wavefront 
can be rapidly retrieved using a well-trained Y-net. Besides the optical 
diffraction, this two-to-one Y-net may also be applied on learning other 
physical principles, such as the transmission of sound wave. 
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Abstract We present a newly developed method for snapshot 
multispectral imaging. The core idea is to use a diffractive op- 
tical element (DOE) in an intermediate image plane. The main 
advantages are the potentially cost effective implementation for 
different applications, e.g. for classification and the possibility 
to use different spatio-spectral samplings at different field posi- 
tions. By appropriate choice of the DOE it is possible to chose the 
spectral and spatial sampling pattern. We also shortly address 
the issue of light efficiency for different approaches towards mul- 
tispectral imaging. 


Keywords Hyperspectral imaging, multispectral imaging, 
diffractive optics 


1 Introduction 


Spatially resolved spectral information can be fruitfully employed in a 
lot of applications ranging from food monitoring to the detection of air 
pollution. Most often, image sensors with so-called Bayer patterns are 
used which mimic the human visual system with three broad spectral 
channels, typically denoted as the “short” (blue), “middle” (green) and 
“long” (red) bands. 

For some applications other or more spectral bands are advanta- 
geous. But one has to keep in mind that there is always a trade-off 
between spectral resolution, number of spectral channels, spatial res- 
olution, light efficiency and measurement time. If high spectral and 
spatial resolution is desired, typically, the amount of light per spatio- 
spectral pixel element is low (for a given entrance pupil and luminance 
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of the scene to be imaged). We can discriminate between “hyper-” 
and “multi-”spectral imaging based on the number of spectral chan- 
nels. Typically, “hyper” is used for a lot of channels (e.g. more than 
10). In the following we use the term “multi” if more than one chan- 
nel is used. Therefore, even a standard RGB sensor is defined to be a 
“multi-spectral” sensor. 

In general one can distinguish between snapshot and scanning sen- 
sors. Typically, for a small number of channels snapshot sensors are 
possible whereas for a large number of channels scanning approaches 
are applied (most often line-by-line imaging, so-called “push-broom 
imaging”). In this contribution we focus on snapshot imaging. An ex- 
cellent overview and review is given by Hagen et al. in [1] and in the 
following we only will mention the main methods without going into 
detail about all possible sub-variants. Fig. 1 shows the basic sensing 
principles that are employed. 

Most often, mosaics of absorption-based filters are used (as in the 
traditional Bayer pattern). This approach has a lot of advantages and 
is very cheap in mass production. 

If more and narrower channels are desired, interference-based fil- 
ters are employed [2]. Of course, the usable light per spatio-spectral 
sampling element is proportional to the spectral bandwidth and anti- 
proportional to the number of spectral channels if there is no spectral 
overlap (compare section 3). Therefore, for most applications one has 
to find a trade-off between spatio-spectral sampling and signal-to-noise 
ratio. 

Anyway, disadvantages when using mosaics of dielectric filter have 
to be kept in mind. Homogeneous manufacturing of areas of such 
mosaics is complicated and expensive and the spectral response of a 
filter depends on the angle of incidence of the light (and neighbouring 
pixels). Therefore, image-sided telecentricity is advantageous. Anyway 
a thorough calibration of the sensor, ideally for every pixel, is necessary 
if really sensing with accurate spectral resolution is desired [2]. 

A variation of the standard mosaic approach is to use image replica- 
tion. In this case for each of the replicated images an individual filter is 
employed. Filter manufacturing becomes easier but image replication 
has to be introduced. Most easily this can be realized macroscopicaly 
by just using several cameras side-by-side, each one equipped with one 
individual filter. However, for three-dimensional scenes there is a par- 
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Figure 1: Basic sensing principles for snapshot multi-spectral imaging. 


allax between the individual images that should be somehow corrected 
by image post-processing or otherwise leads to errors. The parallax er- 
ror is proportional to the separation of the the entrance pupils of the in- 
dividual image channels. Therefore, miniaturization is advantageous, 
leading e.g. to approaches like the one described by Hubold et al [3]. 

A classic alternative is to use the same entrance pupil and to split the 
image by dichroic beam splitters. This has been used a lot in commer- 
cial RGB color cameras because the light efficiency can be improved 
by that approach, of course at the cost of the need for multiple image 
sensors and their alignment. 
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If extensive post-processing is possible, so-called “computed to- 
mography imaging spectroscopy” (CTIS) is an interesting option [4]. 
The light is diffracted by a computer-generated hologram in multiple 
diffraction orders that lead to separated copies of the scene on the im- 
age sensor. Each copy consists — again — of copies, one for each wave- 
length. On the sensor one obtains an overlap of all these wavelength 
separated copies and in the post-processing one tries to reconstruct the 
original multi-spectral information. Fast implementation is possible 
using neural networks [5]. 

The integral field approach uses spatial sampling with a pinhole ar- 
ray in combination with an imaging system with strong (lateral) chro- 
matic aberration to obtain a spectrum for each of the sample points. 

Now, if we open more pinholes the naive (and robust) sampling and 
dispersion approach will fail and we — again — will have overlap 
of information on the image sensor and some kind of reconstruction 
to obtain the spatially resolved spectral information is needed. Such 
approaches are typically denoted as “compressed sensing”. 

Fig. 2 and 2 show a qualitative comparison of the different sensing 
principles with respect to the key parameters of a snapshot hyperspec- 
tral sensor. 


2 Diffraction-based multispectral sensor 


One of the main disadvantages of mosaic-based multispectral imaging 
is the costly and difficult manufacturing of the mosaic filter. For high 
volume applications, of course, this is not an issue and such filters can 
be cheaply manufactured. But if specialized areas are to be realized, 
the initial development cost would be huge. 

In Fig. 4 we show an alternative solution that uses diffraction in- 
stead of interference or absorption-based filters. It becomes possible 
to realize arbitrary spatio-spectral patterns by manufacturing a corre- 
sponding diffractive optical element. Such manufacturing is possible 
at rather manageable cost by several companies and universities. 

The DOE is located in an intermediate image plane and deflects the 
light dependent on the wavelength. For understanding the working 
principle it is beneficial to first assume image-sided telecentricity of 
the first imaging stage and one large grating with constant grating pe- 
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Figure 2: Qualitative comparison of different snapshot multi-spectral imaging ap- 
proaches (large amplitudes are advantageous). Part 1 


riod as the DOE. Due to the telecentricity, the chief rays will arrive at 
the same angle on the DOE and will be deflected according to their 
wavelengths. 

Different wavelengths then will hit the filter plane (actually the 
Fourier plane of the second imaging) at different locations and we can, 
therefore, let a certain spectrum pass the filter by using an appropriate 
iris. The rest of the second imaging system refocuses the light onto the 
monochrome (or color, if we want to combine with absorption-based 
filtering) image sensor. 

By that approach we could make a sensor having a certain spectral 
response but only one channel. 

However, we can now replace the simple grating with a more com- 
plicated diffracting structure. E.g. we can use different micro-gratings 
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Figure 3: Qualitative comparison of different snapshot multi-spectral imaging ap- 
proaches (large amplitudes are advantageous). Part 2 


with different grating periods at different spatial locations in the in- 
termediate image plane. We choose the periods of the gratings such 
that a certain wavelength will be deflected in the appropriate way so 
that it will pass the iris. Each “pixel” in the intermediate image will 
then consist of a micrograting and the grating period determines which 
wavelength will pass the iris. 

Arbitrary spatio-spectral patterns can be realized by this kind of 
“grating-mosaic” and one can even realize complex spectra at one 
point by replacing a micro-grating with a more complex “computer- 
generated hologram”. 

Unfortunately, the spectral resolution is strongly coupled with the 
spatial resolution because the filter acts as a spectral filter and the aper- 
ture stop of the imaging at the same time. If we use a small hole as the 
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Figure 4: Principle of the diffraction based multi-spectral sensor. 


iris, the spectral resolution AA increases but the resolution according 
to Rayleigh Ar decreases. In [6] we derived the following uncertainty 
relation, which strongly depends on the minimal grating period d that 
can be manufactured ( Ar is given in the intermediate image plane): 


AA-Ar>A-d (1) 


For a given minimum critical dimension of the DOE manufacturing 
d and a given size of image (and intermediate image) we will obtain 
a certain maximum number of resolution cells with a certain spectral 
bandwidth. This corresponds to the information that can be captured. 
With the intermediate image size of w x h the information is given 


by 
wh A whAAA 


OEN . < 
Ar? AAT Xd 


(2) 


if the whole usable spectral range is denoted by A. 

For an intermediate image with 20 mm x 20 mm, a minimum grating 
period of d = 2um, a spectral bandwidth of 300 nm and a spectral 
resolution of 50 nm we obtain Q ~ 5-106. 

The less spectral channels we use, the larger the overall information 
that can be captured. 

In Fig. 5 we show an example of a measurement with a 7 channel 
sensor where the DOE consists of stripes. Such an arrangement is es- 
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Figure 5: Signal on the image sensor during narrow-banded illumination of a USAF tar- 
get. The USAF target was illuminated in transmission with a central wave- 
length of 632 nm. The half-width of the illumination spectrum was 10 nm. 


pecially useful for detecting small shifts in wavelength. In the shown 
example spectral shift of 0.5 nm can be measured. 


3 Light efficiency 


Apart from the quite obvious parameters spatial and spectral resolu- 
tion, the light efficiency is also very important. Good light efficiency 
allows one to use larger F-numbers or shorter exposure times at the 
same signal-to noise ratio. 

We want to compare the different multispectral snapshot technolo- 
gies according to the light efficiency. The baseline is a monochrome 
image sensor without any spectral channels. 

The conventional absorption- or diffraction-based mosaic filter will 


32 


Areal multispectral sensor 


be reduced the light efficiency by a factor 


h=% © 


Beware that this is not the same than the number of spectral chan- 
nels. It is advantageous most of the time to have a good light efficiency 
by using strongly overlapping filters. This is rarely done for commer- 
cial sensors but the standard in biology (compare e.g. the spectral 
responses of cones in the human eye). For classification purposes it is 
indeed often useful to employ overlapping channels and even simple 
processing can be used to classify based on spectral information. 

In the human visual system, e.g. differences between the red and the 
green channel are “computed”. The difference signal varies strongly 
with the spectrum of the input light if it lies in the overlapping re- 
gion. Therefore, humans are extraordinary good in discerning green- 
yellowish colors. Obviously, object classification performance as well 
as light efficiency for ordinary scences is quite good. 

The integral field imaging approach uses an amplitude mask in the 
intermediate image plane [7]. There is no spectral loss of light but, of 
course, the mask spatially filters and thereby eliminates a lot of pho- 
tons. The separation of the individual pinholes should be at least N 
times larger than the diameter of the pinhole if we want to have N 
separated spectral channels. The associated loss is 


h= N (4) 


But again we could allow some kind of spectral overlap. 

CTIS avoids the use of filtering at all. All incoming photons in prin- 
ciple (we neglect practical issues like the diffraction efficiency of the 
employed hologram) will arrive at the image sensor. However, it is 
not clear how to really compare with the filter-based pattern. Due to 
the overlapping of information a reconstruction step is necessary and 
at this stage noise might be amplified and artifacts might be intro- 
duced. Therefore, the really useful light efficiency is not 100% (f = 1). 
The overall noise is also increased because readout noise, quantization 
noise and fixed-pattern noise contributions will increase due to the ef- 
fectively increased number of pixels that are exposed. 
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Compressed sensing in a snapshot approach lies somewhere between 
integral field imaging and CTIS. Again, information overlap will occur 
and, therefore, reconstruction is necessary. However with less informa- 
tion overlap compared to CTIS and less light reduction compared to 
integral field imaging. 

The newly proposed diffraction based approach looses the light at 
the central iris. And the associated loss is simply again the same as 
with the integral field imaging approach fz if the individual channels 
are to be spectraly seperated. However, the approach might be worse 
because the F-number of the first image stage again is coupled with 
the spectral resolution: a large ray bundle would lead to bad spatial 
resolution. If one wants to achieve more spatial information with the 
same spectral behavior one has to increase the size of the intermediate 
image and as a result the whole setup becomes larger. 

The conclusion is: All the approaches lead to more or less the same 
effective loss of light. The higher the spectral resolution (spectral half 
width of the channels) is chosen, the more loss is introduced. 

One should carefully rethink if high spectral resolution anyway is 
necessary because overlapping channels are a good thing for a lot of 
applications. 


4 Conclusion 


The proposed sensor has the same light efficiency than other well 
known snapshot multispectral sensing principles. The main advan- 
tage is that one can freely chose spatial and spectral resolution at each 
position of the scene and that even more complex spectral responses 
can be easily realized using standard diffractive optics manufacturing. 

However, spectral and spatial resolution are coupled by an uncer- 
tainty relation and also the F-number is coupled to the spectral resolu- 
tion. In addition, an intermediate image is necessary. In practice this 
leads to increased space requirements for the sensor. 

We thank the German ministry for education and research (BMBF) 
for financial support under the grant 13N15165 and Simon Amann for 
fruitful discussion on CTIS. 
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Abstract Imaging through turbid media leads to a great loss of 
information decreasing the image quality. In this work we try 
to palliate this problem by adding an absorbent to the medium, 
eliminating part of the scattered radiation responsible for the tur- 
bidity. This research work is preceded by the demonstration of 
the effectiveness of black carbon powder as an absorbent, leading 
to improved quality images [1,2]. With this aim, we use graphene 
nanoplates as an absorbent and compare the results with black 
carbon powder in order to study the possible improvement. 


Keywords Vision, absorption, scattering, turbid media 


1 Introduction 


When a medium is interposed between an object and the detection sys- 
tem, there is a loss of quality of the transmitted image due to the light 
behavior through the medium. The transparency property of a system 
affects how the light behaves passing through it. For instance, translu- 
cent materials, such as diffusive media, allow light to pass through 
them, but it suffers changes. Some photons pass through the body and 
reach the detector without alterations (ballistic photons), some fail to 
pass through it and are retained in the medium (absorbed photons) 
and others suffers changes in its trajectory (scattered photons), not al- 
lowing a clear vision, since they arrive the detector in a random man- 
ner. Adding an absorber to the medium can improve the image quality 
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since it has more chances to absorb the scattered photons due to its 
longer path than the ballistic ones, hence part of the scattered radiation 
will be eliminated before reaching the detector. 

Consequently, numerous studies have been carried out in these me- 
dia for several years, giving rise to certain mathematically complex 
techniques such as stellar interferometry, inverse scattering, or fluo- 
rescence, among many others, trying to solve this problem. Regard- 
ing fluorescence, it is worth mentioning a test performed at the end 
of the 20th century, where a technique to improve the image quality 
of an object hidden by a diffuse medium combining fluorescence and 
absorption was tested, in it was shown that the image quality could 
be further improved by absorption, selecting the spectral range of the 
fluorescence light that is highly absorbed by the medium [3]. In addi- 
tion, at the end of the 20th century, a new technique, simpler to per- 
form, was introduced to improve vision through a random medium 
with high diffusion by using the absorption present in the medium. 
It was proven that absorption reduces the intensity of scattered light, 
that generates the image noise, below the intensity of the ballistic sig- 
nal, which forms the image. This reduction in the signal-to-noise ratio 
allows to see through a diffusive medium that would be opaque with- 
out the presence of absorption [4]. Another test at this time showed 
that by using the absorption method to improve image quality in tur- 
bid media, the received energy decreases, but so does the path of the 
photons arriving at the detector, meaning that more scattered photons 
are absorbed, with a higher trajectory, than ballistic photons. In addi- 
tion, it showed that the results obtained were similar to those achieved 
with the time-gating technique, which is more complex and expensive. 
This method is the most widely used for breast imaging. This last trial 
was performed to contemplate a new technique in medicine to detect 
breast tumors [5]. Gradually, the applications of this methodology have 
grown, reaching the military industry [6], and the astronomy [7]. 

The basic methodology we have used is similar that developed in [1]. 
In this work, the improvement of image quality is studied using black 
carbon powder as an absorber in two different scattering media, one 
consisting of zinc oxide nanoparticles, and the other of polystyrene 
nanoparticles. This last technique is the one of interest in the present 
investigation and on which the study has been based. For this pur- 
pose, we have made a series of samples and tested them in the labo- 
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ratory for subsequent analysis in Matlab using the SSIM function. In 
2018 another paper was published following the research line of the 
2016 [2]. In this article the influence of the wavelength of the incident 
light on the image enhancement is studied. Due to the angular dis- 
tribution of the scattering depends on the size of the scatterers with 
respect to the wavelength of the incident light, they determine a new 
approach to image enhancement, selecting the appropriate wavelength 
range [2]. The most recent paper we have found is from 2019, in which 
authors analyze the absorption-scattering coupling and its impact on 
haze in random media. They also introduce the haze-absorption sen- 
sitivity spectrum which quantifies the capacity of absorption-induced 
haze supression [8]. 

It is also worth mentioning other interesting papers about absorp- 
tion, scattering and turbid media, using other techniques and ap- 
proaches [9-19]. 

Taking into account the aforementioned investigations, the aim of 
this paper is to compare the image enhancement achieved by graphene 
and black carbon powder as absorbers, and to study the influence of 
incident light by performing the experiments using white and red light. 


2 Theory 


To understand the absorption phenomenon, light must be understood 
as a corpuscle, quantized, with discrete values of energy. Absorption 
occurs when an electron is excited by a photon. Electrons occupy or- 
bitals separated from each other by discrete amounsts of energy, in 
which the number of electrons is limited by the Pauli exclusion prin- 
ciple. When excited, the electron will move to a higher energy level, 
absorbing the energy and leaving a hole in its original position. 

Imaging through absorbers leads to a loss of brightness since not all 
ballistic photons reach the detector. In the context of the image vision, 
scattering occurs, for instance, when a particulate system is interposed 
between the object and the detection system, such as turbid media. 
The rays emitted by the object are obstructed by the particles in the 
medium, deflecting their path. 

Scattering depends on the particle size. We can distinguish two mod- 
els within the context of our work: the Rayleigh and the Mie regimes. 
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When the particle size is much smaller than the wavelength of the in- 
cident light, we are in the Rayleigh case, while if the particle diameter 
is of the order or larger than the wavelength of the incident light, it is 
Mie scattering. 

Imaging through scatterers generates a loss of quality of the trasmit- 
ted image, since the scatterers deflect the photons that arrive disorderly 
at the detector, resulting in blurred images. Therefore, in turbid me- 
dia we can distinguish three types of photons: the ballistic photons 
that form the image, arriving in an orderly manner at the detector; the 
scattered photons, which generate blurred images because their trajec- 
tory has been altered and they arrive randomly at the camera; and the 
absorbed photons, which do not reach the detector, causing a loss of 
intensity. 


3 Methodology 


The procedure we have followed to perform the experiments its shown 
in the diagram below 2. 

It has consisted of, first of all, the preparation of the samples, in 
which we have used graphene and black carbon powder as absorbers 
(both separately), polystyrene nanospheres as diffusers and distilled 
water as the matrix medium. We tested four different solutions, grad- 
ually increasing the amount of diffuser, with concentrations of 30, 50, 
70 and 100 ul in 10 ml of distilled water, and absorbers concentrations 
of 0.3; 0.4; 0.5; 0.6; 0.7; 0.8; 1.3 and 3.3 mg for the first three solutions 
and 0.5 and 1.3 for the last one. 

Afterwards, we introduce the samples into the cuvette of the optical 
system for imaging. The imaging setup we use consisted of a CMOS 
camera, a 1951 US Air Force resolution target as the object, a biconvex 
convergent type lens, a rectangular glass cuvette to place the samples 
in between the camera and the object. As radiation source, we used an 
incoherent white led light with a 650 nm bandpass filter, and without. 

Once the images have been taken, to compare and evaluate their 
quality, first, we must select those that are comparable with each other, 
for what we use the signal-to-noise ratio following three different meth- 
ods depending on the application. To calculate the signal-to-noise ratio 
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Figure 1: Imaging setup. 


of the images we use the following formula: 


<I> 


SNR = 
v<R>»-<1>2 


Where, I = Image intensity and < I >= Average image intensity 

Here, we present the methods we have used to choose the images 
based on the application. The resulst vary depending on the method 
selected. 

Method 1: We select the images considering the signal-to-noise ratio 
of the reference image. This criterion is useful for those applications 
where the reference image is known, for example, in the geostationary 
satellite case. 

Method 2: Considering the exposure time of the reference image we 
select the disturbed one and, depending on its signal-to-noise ratio, we 
choose the images with an absorber. This criterion is useful for images 
whose damage degree is such that it is not possible to return to the 
reference one, and therefore, it is necessary to work on the disturbed 
image. For instance, in optical space elements that have suffered such 
a deterioration that you cannot return to their initial conditions, and 
therefore it is required to work with the deteriorated image. 

Method 3: Considering the exposure time of the reference image 
we select the same disturbed one. Then, to select the images with an 
absorber we vary the exposure time seeking to return to the signal- 
to-noise ratio conditions of the reference. This criterion is useful for 
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applications like method 1 but when the disturbance appears instanta- 


neously. 
Once we have selected the images to evaluate them we use the struc- 


tural similarity index (SSIM) in Matlab. 


Image selection 


Real object 
Matrix ` 
i f (Reference image) = and comparision 
medium 7 (SNR) 
+ Add diffuser < + . 
Method || Method || Method 
1 2 3 
Matrix medium r; = 
Disturbed z 
image nt. 
diffuser Comparable 
+ Add absorber image packages 
Matrix 


+ 
ad + 
medium + Adaptimage f Objective image quality and 


diffuser + resolution evaluation (SSIM) 


absorber 
Figure 2: Procedure diagram. 


SSIM quantifies the similarity of an image regarding the reference 
one, taking into consideration the structure, contrast, and illuminance 
of the images (x,y), as we can see below [20,21]. 

1) Iluminance comparison: 


2Uxhy Ar Cy 
l See ADAE 
Od) > Soe Gy” 


where C; = (0,01 . pbitsperpixel _ 1)2 
2) Contrast comparison: 


( ) 20x0y +0 
x, = =n aa ee 
en o3 + C2 


being C = (0, 03- pbitsperpixel = 1). 
3) Structure comparison: 
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Here, Hx, My, Ox, Oy and oyy are the local means, standard deviations 
and cross-covariance. From these quantities may be demonstrated that 
the SSIM index is: 


(2uxHy + C1) (2oxy + C2) 
(u3 +3 eC) We Oy + C2) 


SSIM(x,y) = 


4 Experimental results 


In this section we show the most important results obtained. We have 
compiled those experiments in which an improvement of graphene 
over graphite is detected, for each method explained above with red 
and white light sources. First we show the results for red light for each 
technique, and then those obtained with white light. 

As we can observe, both numerically by means of SSIM and visually, 
the fourth image on the right, adapted with graphene, is the one that 
most resembles the reference picture, improving with respect to the 
image perturbed with the diffuser and the image adapted with graphite 
(Figures 3 to 8). 

For the image series from 3 to 5 with red light we observe that the 
SSIM values achieved for the adapted versus perturbed images present 
larger differences than for the white light case (Figures 6 to 8), espe- 
cially for method 1. 


u 
al 
al 


(a) (b) (c) (d) 


Figure 3: Results of the method 1 for red light. (a) Reference image 10ml distilled water. (b) 
Disturbed image 30 ul polystyrene, SSIM=0.5497. (c) Image with 0.4 mg graphite, 
SSIM=0.6818. (d) Image with 0.4 mg graphene, SSIM=0.7849. 
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Figure 4: Results of the method 2 for red light. (a) Reference image 10ml distilled water. (b) 
Disturbed image 50 u! polystyrene, SSIM=0.4105. (c) Image with 3.3 mg graphite, 
SSIM=0.6629. (d) Image with 3.3 mg graphene, SSIM=0.7107. 
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(a) (b) (c) (d) 


Figure 5: Results of the method 3 for red light. (a) Reference image 10ml distilled water. (b) 
Disturbed image 70 ul polystyrene, SSIM=0.5868. (c) Image with 0.4 mg graphite, 
SSIM=0.9717. (d) Image with 0.4 mg graphene, SSIM=0.9968. 


Blurred resolution enhancement 


The results obtained using white light for the three methods are 
shown below. 
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Figure 6: Results of the method 1 for white light. (a) Reference image 10m! distilled water. (b) 
Disturbed image 70 ul polystyrene, SSIM= 0.9762. (c)Image with 0,6 mg graphite, 
SSIM=0.9918. (d) Image with 0.6 mg graphene, SSIM=0.9920. 
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Figure 7: Results of the method 2 for white light. (a) Reference image 10ml distilled water. (b) 
Disturbed image 50 ul polystyrene, SSIM= 0.5815. (c) Image with 3.3 mg graphite, 
SSIM=0.7384. (d) Image with 3.3 mg graphene, SSIM=0.7736. 
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Figure 8: Results of the method 3 for white light. (a) Reference image 10ml distilled water. (b) 
Disturbed image 70 ul polystyrene, SSIM= 0.4525. (c) Image with 0.6 mg graphite, 
SSIM=0.9918. (d) Image with 0.6 mg graphene, SSIM=0.9920. 
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5 Conclusion 


In this paper we have presented the image enhancement by the ab- 
sorption technique using graphene as an absorber. Likewise, a com- 
parison between the enhancement obtained by graphene and graphite, 
and white and red light, has been made. 

We have found that, in most cases, for the type 2 suspension and 
red light, the concentration at which the best SSIM values are achieved 
for graphene is 0.4 mg. We have encountered that in the case of vision 
loss due to image intensity saturation, there is a generalized improve- 
ment when introducing both, polystyrene nanospheres and the two 
absorbers. We expected significant results in which the enhancement 
would be visible to the naked eye for any of the three methods, how- 
ever, for method 1 the improvements are practically negligible, being 
visibly unnoticeable. Also, we found the most important results for 
method 3, and the least remarkable for method 1. We found more sig- 
nificant results with red light rather than with white light. In addition, 
white light saturates sooner. 
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Abstract Anomaly detection with machine learning in industrial 
inspection systems for manufactured products relies on labelled 
data. This rises the question how the labelling by humans should 
be conducted. We consider the case where we want to optimise 
the cost of the combined inspection process done by humans and 
an algorithm. This also influences the combined performance of 
the trained model as well as the knowledge of the performance 
of this model. We focus on so called one-class classification prob- 
lem models which produce a continuous outlier score. We estab- 
lish some cost model for human and machine combined inspec- 
tion of samples. We then discuss in this cost model how to select 
two optimal boundaries of the outlier score where in between 
these two boundaries human inspection takes place. We also 
frame this established knowledge into an applicable algorithm. 


Keywords Mathematical methods and models, artificial intelli- 
gence and machine learning, quality control 


1 Introduction 


The detection of non-common patterns in a batch of samples is a strong 
point of human visual cognition. Still there are many known limita- 
tions to human visual inspection as well as cost issues in real world pro- 
duction systems. The training of machine learning models for anomaly 
detection of industrial inspection problems is often done as a one-class 
classification problem where only good samples are presented to the 
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algorithm. The background for this is that it is in general easy to ac- 
quire good samples but difficult and expensive to find anomalous sam- 
ples. A dataset for benchmarking this type of algorithm is the MVTec- 
dataset [1] [2]. The best performing model? on this dataset is “Patch- 
core” [3]. For a given picture sample a “Patchcore”-model after training 
produces an outlier score together with a heat map on the likelihood of 
being an anomalous area. This is done by performing outlier-detection 
on the deep-features of a pretrained neural network of the images. The 
cutoff values for an anomaly in the outlier score of “Patchcore” are op- 
timised in the paper by finding the cutoff-value with the highest F1- 
score. This already assumes that there are known outliers which are 
potentially very costly to acquire. Although we think of models de- 
signed for the MVTec dataset like “Patchcore” as the main application, 
our method of finding two boundaries for the outlier score, where in- 
between human inspection will take place, will work for any model of 
an one-class classification problem [4] with a continuous score. 

More precisely, in this paper we formulate the problem of optimal 
usage of human inspection after acquiring of initial data for training. 
For this we assume that there are certain costs for inspection and costs 
for falsely classified samples. We are not are aware that such a human- 
in-the-loop machine learning consideration exists in the literature, al- 
though more generic considerations about iterative machine teaching 
and active learning can be found in [5]. A similar process by giving 
the human some sort of optimal presentation of data for labelling was 
done in [6]. However, this method is not applicable for the one-class 
outlier classification problems on images we consider here. In [7] it is 
shown, that for one-class classification models one can train an addi- 
tional model on the bad samples and use a combined score on the good 
and bad sample models to find the most promising new samples for la- 
belling. The authors show that using one of their active learning meth- 
ods one can achieve faster convergence and better overall performance 
of the model. We refer to Munro’s book [8] for a general overview of 
human-in-the-loop machine learning. 

Another important concept which we will discuss and use is that 
of probabilistic classifiers. Probabilistic classifiers are classifiers which 
output a probability distribution on the target classes instead of just a 


3 https://paperswithcode.com/sota/anomaly-detection-on-mvtec-ad 
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score. Model calibration is a technique which achieves that a classifier 
will have a probabilistic output [9] [10]. A calibrated one-class classifier 
will give out a probability p which will represent the probability of 
being in the one class. In safety-critical applications it is important 
to have an idea of uncertainty of the model. Hence a probabilistic 
output is of great help with regard to such problems. Even in situations 
which are just cost-critical we will show that we can exploit having an 
uncertainty estimate of the classifier for a given sample to make better 
decisions. 


2 Model 


In this section we will describe the necessary pre-conditions and cost 
assumptions. Further we describe how, after initial training of our 
one-class classifier, we can establish our first optimal boundaries. We 
do describe multiple alternatives here. Then we pass on to acquiring 
more knowledge about the outliers we will encounter and their outlier 
scores. This will then be used to establish optimal decisions for the 
cutoff parameters of human inspection in the sense of our pre-made 
cost assumptions. 


2.1 Pre-conditions 


First we introduce a few more preliminary and formal assumptions 
and notations. We assume that there exists a set of images or more 
general data I which each have a hidden label {0,1} where images 
with label 0 are good samples and images with label 1 are anomalous 
samples. We will observe these samples in some process such as an 
industrial inspection task one after another. For our cost considerations 
we assume that the process of labelling a sample by a human has a 
cost c; associated with it. Further we assume that human labelling 
perfectly assigns the correct label to the data. With N initially labelled 
data points we train and test amodel M which will then produce an 
outlier-score M(i) € R for every (new) image i we observe. We set a 
lower and upper decision boundary for manual inspection b; and b, 
such that any image i with outlier score M(i), where b; < M(i) < by 
holds, will be inspected by a human. 
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2.2 A priori cost and anomalous data 


For our cost considerations we further assume that there is a known 
(possibly non-linear*) cost-function C jr such that the absolute cost of 
missed outliers can be calculated as C;(FOR)-K where FOR is the 
false omission rate, ie. the percentage of anomalies in the accepted 
samples, and K the absolute number of accepted samples. The cost of 
false positive samples are associated with a cost per sample of c,. This 
could be for example lost revenue and disposal costs of an unnecessar- 
ily disposed sample of good state. 


2.3 Initial cut-off boundaries 


We assume now that the initial sampling and labelling of data D and 
the training of a model M is conducted. We update our initial belief 
Po of the outlier percentage by taking the percentage of outliers in the 
sampled D into account. We are now interested in finding optimal 
cutoff parameters bj, b, in this stage. We discuss multiple alternatives 
now. 


A priori anomaly distribution 


In the first case we assume that the distribution of the outlier score of 
samples with label 0 and also of the samples with label 1 is both Gaus- 
sian”. For the good samples we can directly estimate this distribution. 
We get some distribution gą with mean pg and variance gg. For the 
bad samples we also get some Gaussian distribution gẹ. In the case 
where there are no bad samples available, we take some initial belief 
about the distribution, which we could take from former observations 
such as the MVTec dataset or a similar product line (see Figure 1), as 
our distribution. We can find the optimal parameters bj, bu in terms of 
cost. In order to find these parameters one would minimise Equation 1 
of Section 2.4. 


4 One reason for non-linearity could be reputation costs, i.e., due to network effects 
reputation falls non-linearly with increasing fault-rate. 
5 A non-Gaussian distribution could also easily be considered here. 
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(a) Hazelnut 
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Figure 1: These are the Gaussian distributions of anomaly scores for different items from 
the MVTec Dataset. Blue represents the good sample distribution and red 
represents the bad sample distribution. The model where the anomaly score 
stems from was Patchcore [3] and it was trained with training sample split of 
the MVTec dataset. Then the anomaly score output of the trained model on 
the good and bad samples of the test dataset split was used to find the shown 
Gaussian distributions. On these data-sets the established model has an AUC- 
score of 0.9996 for Hazelnut, 1.0 for Bottle and 1.0 for Leather on the test dataset 
samples. 
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Optimal cut-off sigma 


Another approach would be to omit to define an a priori distribution of 
gp and instead take a cutoff parameter x such that any sample with out- 
lier score higher than pg + xog is considered anomalous. The choice 
of the parameter x can be done as follows. We assume that we cannot 
inspect every piece which we observe but only some percentage p; of 
it. Hence we have to find x in such a way that the expected amount 
of samples classified as anomalous is at most the amount that can be 
handled. Hence we have to pick x such that 


oo 


pi 2 (1- po) f 8g + Po 


UgtX" Og 


holds. Note that we omitted the expected false negative classified sam- 
ples in our considerations, but we assume that this amount is negligibly 
small. In case there is no sample to classify at the moment we might 
pick a random sample. In case we acquired enough bad samples we 
can infer the distribution g, or update our initial belief about it. More 
details on the belief update of a Gaussian distribution can be found 
in [11]. 


Calibrated output 


In some cases the model comes with a calibrated probabilistic output. 
This roughly means that the output value of the model M(x) is a prob- 
ability of being an outlier, e.g. we expect to find q x 100-many outliers of 
100-samples i’ with score M(i’) = q. With such a calibrated model we 
can directly use the model output as our probability. We will not fur- 
ther assume that our model is calibrated although the following should 
be straightforward to adapt for directly using this output instead of 
learning some probability as in the previous paragraph. 


Now we have found a priori parameters bj, b, or just bj(= fg + x: 0g). 
With these we can set up our initial human in the loop process. After 
some time we will enrich our dataset of labelled pieces and therefore 
can update our believe about the Gaussian curves gg, gp as described 
in [11] or interfere the distributions gg, 9, directly from all the gathered 
data. There is some caveat with the selection of the samples: Because 
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of our parameters the selection of the samples is biased. This either 
needs to be corrected through enough random samples or giving the 
unlabelled data some pseudo label with continuous value greater 0 and 
smaller than 1. Additionally we could use the gathered data to further 
improve the model M or respectively re-train a new M with the new 
data and old data depending on the algorithm in use. In any case we 
now fix some model M, some po and the Gaussian distributions gg, gp 
associated with it as well as the gathered data. In case we observed 
and classified a new sample we could continue to do a belief update of 
our estimated values po, gg and gy and retrain our model M in order to 
keep improving it. But we omit such considerations in the rest of the 


paper. 


2.4 Cost-calculation 


We calculate the cost associated for some fixed b; and b, for the 
next samples. We expect to see po-percent outliers which we have 
updated from the observations D. Additionally we can calculate 
the expected percentage that the next sample will be true positive: 
TP(b)) = po So gp, true negative: TN(b,) = (1 — po) fice 8g, false neg- 
ative: FN(b;) = po Th gp and false positive: FP(b,) = (1 — po) J gg- 
From this we can calculate the false omission rate FOR = ENN: Now 
for the next sample have the cost function C (bı, bu) defined as follows: 
C (FOR (b;))- [TN (bu) + FN (b;)] + cr: FP(bu)+ 


by by (1) 
c- po) | Sete Po |, 8 
I l 


This function is our minimisation target for which we choose b; and b, 
accordingly: 
min C(b1, bu, 8v8g Po) 
19u 
st. by < by (2) 


by, bu eR 


where R is the set of the extended real numbers which additionally 
contains plus and minus infinity, i.e. RU {-, +00}. 
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In case of a very low outlier rate pọ we can simplify the cost by 
setting b,, = œ and the optimisation problem becomes a single variable 
problem. Often it will be the case that we have a fixed percentage 
of images, say pr, which we can inspect due to such things as fixed 
amount of available human labour. In this case the lower part of the 
cost function 1 will be replaced by the constraint 


bu bu 
Pp=(1—Po) | &g+Pof 8 
by by 
If we additionally set b, = co we can already find the optimal b; by just 
using this constraint. But these considerations are still useful as we can 
now estimate the cost of our system and further estimate whether it is 
useful to employ or dismiss a human at a certain cost or estimate the 
cost saving for a higher or lower rate of inspection of samples. 


2.5 Non independence of outlier observations 


In the case where we believe there is a non-independence of the series 
of observed data® we could increase the believed percentage of outliers 
Po for the next few observed samples after observing an outlier. This 
ensures that the costs stay optimal for the next observed samples with 
higher anomaly probability. Note that in more complicated production 
environments we may observe pieces from multiple different machines. 
If possible one should keep track of the machines a piece went through 
to get more individual assessment of the anomalous probabilities. 


3 Algorithm 


In this section we combine the observations established in the last sec- 
tion into an combined algorithm (see Algorithm 1). As an input to 
our algorithm there is a one-class classification model M that needs N- 
many samples for initial training and testing, and there is also a belief 
about the percentage of outliers pọ in the samples to be observed. Ad- 
ditionally we have the cost function C; and a real value c, representing 
the cost of a false positive sample. Moreover we fix an amount of out- 
liers we want to observe L. The algorithm starts by letting a human 


6 A broken machine could for example produce a sudden stream of defect parts. 
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label samples till we receive a set D containing N-many labelled sam- 
ples with label 0. We use this dataset D to train some model Mp which 
then is used to produce some outlier scores for the test data split of 
D. This is then used to find more anamalous samples in order to form 
a probabilistic model by inferring a Guassian curve of the good and 
a Gaussian curve of the bad samples. With this we are finally able to 
find the cost optimal parameters bj and b, which mark the outlier score 
interval where human inspection takes place. 


Algorithm 1 Find optimal interval for human inspection 


: initialization: po, C fr Cre CL, N,L 
n 0 
: forn < N do 
wait for next sample s 
get label I(s) (by human) 
n-n+1-[(s) 
Po + belief update through observed I(s) 
: end for 
: return training dataset D , po 
: Mp <train model with D 
: bj + (see Section 2.3 for possible computations) 
:k-0 
: for k < L do 
get next sample s 
if b) < Mp(s) then 
get label (s) (by human) 
end if 
k<-k+lI(s) 
Po + belief update through observed /(s) 
: end for 
: return updated dataset D, po 
: 89,85 + interfere Gaussian from data D 
: solve ming, », C (bi, bu, Sb, Se, Po) 
: return Model Mp and inspection interval values bj, by 
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4 Discussion and future work 


We establish theory for the cost-optimal selection of samples of one- 
class classifications models. For this we established a cost-model and 
showed how to infer probabilistic knowledge of the samples online and 
offline in order to establish a cost-optimal decision for a human inspec- 
tion boundary in the outlier score. Moreover, we have merged this into 
an algorithm which can be applied in production. For now we have not 
considered the case of retraining the model and we can assume that this 
will be done occasionally till the economic evaluation stabilises or the 
performance is satisfactory. Also the problem of a timely dependence 
of the occurrence of outliers which could stem from faulty machines 
was discussed. At worst there could be no outlier samples or only 
a very biased selection of them. A detailed analysis of the practical 
relevance of this problem could be an interesting topic for future inves- 
tigation. There could also be potential for future work especially in the 
case where the one-class problem is a moving target, i.e. the golden 
sample changes over time. The case for selecting valuable examples 
for improving the model performance also seems an interesting area 
not yet considered and will probably require an extra model which is 
also trained with the outliers. Another not yet used feature is utilising 
the presentation of anomalous areas on the image for better outlier vi- 
sualisation for the user decision. There, another optimisation problem 
arises which is the optimisation of the cutoff parameter for the selec- 
tion of the anomalous area. A more general question is the question of 
a good visualisation to improve human performance. 
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Abstract 2D or 3D sensor technology can be used for data acqui- 
sition to monitor the weld quality during laser welding. Com- 
pared to a 2D camera image, the 3D height data contains addi- 
tional relevant information for quality inspection. However, the 
disadvantages are system complexity, higher costs, and longer 
acquisition times. Therefore, we compare two image-based 
methods with the quality assessment based on height data. The 
first method uses feature vectors of coaxial acquired grayscale 
images. The significant advantage is that a camera is often in- 
tegrated into the laser system, so no additional hardware is re- 
quired. In the second approach, we use an Al-based single-view 
3D reconstruction method. The height profile is calculated from 
a camera image and used for further quality assessment. Thus, 
we combine the advantages of 2D data acquisition with higher 
accuracy in evaluating 3D data. In this paper, we analyze a 
dataset of welded hairpins with different defect types and com- 
pare the quality assessment using the height data acquired with 
OCT, the feature vectors from the camera images, and the recon- 
structed height data. 


Keywords Laser welding, hairpin, quality assurance, OCT, 
stacked dilated U-Net (SDU-Net), 3D reconstruction 


1 Introduction 


With the substantial increase in automation of industrial production 
lines, reliable and also automated quality control is essential. Laser 
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welding processes are a key technology for many industrial applica- 
tions and must fulfill high-quality requirements [1]. However, various 
influencing factors can lead to defects in the weld seam, which can im- 
pair the quality and functionality of the product and result in safety- 
relevant defects [2,3]. Therefore, the companies use strict criteria for 
welding quality. 

An increasingly important application with high-quality require- 
ments for laser welding comes from e-mobility. E-mobility will become 
more and more prevalent in individual transportation in the future. 
This is why vehicles’ designs and various components are constantly 
refined and optimized. For the new generation of motors, automotive 
manufacturers increasingly use stators with so-called hairpin technol- 
ogy. The conventional copper windings in the stator of an electric mo- 
tor are replaced by thick copper rods that are welded together, which 
saves space and improves the efficiency of an electric motor. Depend- 
ing on the motor design, between 160 and 220 pairs of copper bars are 
inserted into the sheet metal stacks of a stator, and the ends are con- 
nected, usually by laser welding [4-6]. To ensure the high quality of the 
entire stator, each weld must be checked for a defect [5,7]. Different 
properties and measured variables can be used to evaluate the qual- 
ity of the weld seam [7,8]. Various works show that the evaluation of 
three-dimensional data provides higher accuracy than the analysis of 
two-dimensional camera images [8-10]. The disadvantages are higher 
hardware costs, system complexity, and longer process times. 

This work presents an approach that computes the height map from 
a camera image instead of acquiring it with a 3D sensor. This proce- 
dure allows us to use the height data for quality assessment without 
the disadvantages mentioned above. We perform the 3D reconstruc- 
tion algorithm using a convolutional neural network [11]. The rest 
of this paper is organized as follows: Section 2 discusses the state of 
research in welding quality evaluation of hairpins and using a 3D re- 
construction algorithm. Section 3 describes the experimental setup and 
investigations of the generated dataset. Building on this, section 4 in- 
troduces different approaches for predicting the hairpin quality from 
image data, 3D data, and reconstructed 3D data. In section 5, the re- 
sults are discussed before section 6 provides a summary as well as an 
outlook for future research activities. 
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2 Related work 


There are a variety of quality monitoring and control systems for laser 
welding. The use of machine learning (ML) methods is evaluated in 
[12] and [7]. Unlike many ML applications, the amount of data samples 
in the industrial environment, especially in research, is limited, and the 
computing time may not extend the production time [13]. 

In [14] a post-inspection of laser welds is performed based on images 
using semantic segmentation. Here, a tiny network structure is used 
for the reasons just mentioned. [7] uses images from 3 perspectives, 
front, top and back, to evaluate the seam quality of hairpins. More 
information about the seam connection can be obtained through the 
different views. However, integration into a production line is more 
complex because it is often difficult to attach cameras to the side. The 
resulting accuracy of the network is in the range from 61% to 92% 
[7]. [8] analyze and compare different Convolutional Neural Networks 
(CNN) to perform post-process quality control of hairpins. In addition 
to 2D grayscale images, 3D scans are used as input to the CNN. Based 
on the 3D scans, the classification accuracy is higher than using the 
2D images. This result supports the assumption that the height values 
contain relevant information for quality assessment. In [15] and [10], a 
height profile is also used to determine weld quality in laser welding. 
Especially in hairpin welding, the height difference between the pair 
of hairpins before and after welding provides information about the 
volume of the molten material. This volume, together with the other 
measured parameters of the surface profile of the weld, is crucial for 
the welding quality of the hairpins [9]. 

Due to the cost, higher system complexity and acquisition time, it 
is advantageous to calculate the height profile using a method of 3D 
reconstruction. [16] use shape from shading (SFS) to perform a 3D re- 
construction of a weld seam. Based on the curvature features, the weld 
quality is evaluated. Especially in the classification task of complex 
welds with complex structures and characteristics, the curvature fea- 
ture contains limited information and cannot be applied to this task. 
The SFS algorithm reconstructs a shape based on shading variation, as- 
suming a single point light source and Lambertian surface reflectance, 
where the brightness of an image pixel depends on the light source 
direction and the surface normal. Due to the hairpins’ height and the 
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(a) (b) (c) (d) (e) 10} 


Figure 1: Welding results of hairpins. (a) no weld, (b) good weld, (c) pin not in the focus 
of the laser, (d) weld with too low power, (e) misaligned pin pair, (f) insulated 
copper rods. 


welding bead’s dome, a reconstruction from a single image with SFS 
is impossible since the incidence of light can only be realized on one 
side and the other side is accordingly in shadow. [17] calculates a 3D re- 
construction from several images taken with different relative positions 
between camera and weld during the data acquisition phase. Based on 
the resulting 3D model, a quality evaluation of the weld is performed. 

Deep learning-based methods for 3D reconstruction have shown 
promising results in various research fields. While classical methods 
deal with shape and image properties such as reflection, albedo, or 
light distributions, deep learning-based methods use complex network 
architectures to learn the correlations between 2D and 3D data. Many 
approaches are challenging to integrate into existing industrial pro- 
cesses because new cameras or illumination equipment are required. 
[11] compare different single-image reconstruction methods on an in- 
dustrial dataset. In their investigations, a variation of the U-Net, the 
stacked dilated U-Net (SDU-Net), has prevailed with its performance. 


3 Material 


Laser-welded pairs of copper pins, as shown in Figure 1, are used for 
data acquisition. Different welding results are recorded to obtain a 
representative data set that includes error cases. Data from 953 hairpins 
were acquired from a position above the pins, as this perspective allows 
the integration with the existing industrial process. The 2D intensity 
images of the hairpins were captured using a Baumer VCXG-15M.I 
industrial camera based on CMOS technology. An optical coherence 
tomography (OCT) scanner from Lessmüller Lasertechnik is used to 
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Figure 2: OCT scans at different positions (left side, center, right side). Top row - 
similar height values indicate a good seam. Bottom row - the different heights 
indicate the fault case (misaligned pin pair). 


capture the 3D data. Many line scans are performed to obtain the 
height maps of the entire weld. These are then combined to create an 
overall height map of the component. The exact structure of the data 
acquisition and the assignment of the camera data to the height data is 
explained in detail in [11]. To reflect the real situation in the industry 
with low data availability, we use 10% of the data, i.e., 95 samples, for 
algorithm development. The other 90%, i.e., 858 samples, are used for 
testing and evaluation. 


4 Detection of weld quality 


To compare the result of quality assessments, we analyze various input 
data for the weld inspection. We use the height data acquired by the 
OCT, camera images, and reconstructed height data to create feature 
vectors. 


4.1 Height data acquired with OCT 


The OCT sensor measures the relative height differences within the 
weld seam. Good welding of a pin pair results in a round welding 
bead, which has its maximum in the center. The line scans should have 
a structure like the upper row in Figure 2 over the entire weld bead. 
The bottom row shows the images at the same positions of a weld with 
misaligned pins for comparison. As in [18] and in [19], we compare 
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Figure 3: Difference of the maximum values of the line scans to the center. The max- 
imum value of each line scan is determined. The difference to the center is 
calculated and the values are plotted in a curve. Mathematically this means 
f(;) = |h: — max ;|, where l; is the line scan with index i and h. is the height 
value in the center. (a) Good welds result in a curve with its maximum in the 
center. (b) Defective welds, such as misaligning pins or pins that are not in the 
laser’s focus, can be detected in the curve. 


multiple line scans with each other. For higher accuracy, we scan the 
hairpin in the x- and y-directions with lines at distances of 18 um. 

For quality assessment, we use different criteria. Analogous to [18], 
we consider the difference between the maximum height values of the 
individual line scans to the height of the pin center. Through this 
comparison, we can detect misalignment of the hairpins or misshapen 
welding beads. The procedure is visualized in Figure 3. In addition to 
the curve profile, we evaluate the line scans’ maximum and minimum 
distance to the pin center’s height. If the distance to the pin center is 
too small, the weld is not sufficiently stable. If, on the other hand, the 
minimum distance is too large, this provides information about pores 
or cracks in the pin surface. We also consider the width of the weld 
bead in the evaluation. 


4.2 Camera images 


As mentioned earlier, it is not always possible to capture the height 
profile due to time constraints and the increasing cost and complexity 
of the system. Therefore, we develop a different approach by deriving 
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ag - 
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Figure 4: Detection of the welded and unwelded pin surface in the camera image. The 
detection of the surface of the weld, as well as the unwelded pins, is shown. In 
each case, the right image shows the binary mask overlaid on the image (green 
- weld, red - unwelded pin). (a) good weld, (b) misaligned pin pair, (c) pin not 
in the focus of the laser, (d) insulated copper rods. 


(d) 


the quality-relevant properties of the weld from the grayscale image. 
As with OCT scans, we can also infer the width of the weld from the 
grayscale image. In addition, the size of the weld surface provides 
information about the stability of the weld. We can also detect this 
size in 2D images. For the detection of the seam area, threshold-based 
methods reach their limits due to the low-intensity differences and con- 
trasts in the images. However, CNN-based semantic segmentation can 
detect the area well, even in small network architectures. Analogous 
to [6], we train a small SDU-Net to detect both the welded seam and the 
non-welded pin regions. The predicted masks are shown in an overlay 
representation in Figure 4. 

We can already detect many defect cases by evaluating the width of 
the weld and the size of the two classified areas. As a further evalua- 
tion, we analyze the shape of the weld. In good welds, this is approx- 
imately circular and has no solid corners and edges. However, if too 
little material is melted during welding, no round weld bead is formed, 
and the contour is slightly angular due to the pin shape. Other defects, 
such as copper pins that have not had their insulation stripped, also 
result in edges in the weld shape. Since the weld surface is a closed 
contour, Fourier descriptors can be used to characterize it. Analogous 
to [20], we compute the Fourier descriptors of the contours. An eval- 
uation of the harmonics considers the complexity of the contour. In 
particular, in combination with the information about the size of the 
non-welded pin region, this contains information about insufficiently 
welded pin pairs. The relationship between the defined features and 
the evaluation result of the seam quality based on the height profile is 
shown in Figure 5. 
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Figure 5: Quality-related features derived from the grayscale images. The correlation 
of the features derived from the 2D image with the seam quality based on the 
height profile is shown (GW -good weld, DW - defective weld). 


4.3 Height data from the 3D reconstruction algorithm 


In the third approach, we use an Al-based single-view reconstruction 
method. Thus we combine the advantages of the two methods just 
presented. This approach calculates the height profile from the cap- 
tured camera image. For this purpose, only one camera image must be 
taken in the production line, and the algorithm can replace the time- 
consuming OCT scan. Further analyses can still be performed on the 
more informative height profile. We use a modified SDU-Net architec- 
ture for the reconstruction. Since the model is tiny, with only 162,423 
parameters, it can also be executed efficiently on industrial hardware. 
The exact implementation, the training parameters and the result anal- 
ysis with deviations from ground truth are explained in detail in [11]. 


5 Results and discussion 


The quality assessment of the 858 test samples is performed separately 
with each method to evaluate the different approaches. The ground 
truth is the division into good weld (GW) and defective weld (DW) 
based on the features derived from the entire recorded height map 
using OCT. We evaluate the quality assessment based on the criteria 
visible in the camera image (Cam) and the Al-based 3D reconstruction 
(3D-R) data. When height data is used for quality assessment, only a 
few line scans are usually acquired due to time constraints. The scanner 
made by Lessmueller Lasertechnik has a scan frequency of 70 kHz, so 
a scan of the entire component takes considerable time. Therefore, 
we use an approach in which only six OCT scan lines (three in the x- 
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direction and three in the y-direction) are considered in the evaluation 
(6L). One scan is in the center of the weld, and the other two are on each 
side. In another investigation, we only consider the three scans in the 
x-direction in our evaluation (3L). The feature vectors for the quality 
assessment are defined based on those of the entire height map. The 
results are presented in Figure 6 using a confusion matrix. 

The Al-based 3D reconstruction using the camera images gives the 
best results of the four methods compared. 842 of the 858 test samples 
are classified in the same way as with the ground truth data, even if 
only the camera image was used as input. The discrepancies are due 
to borderline cases. As described in detail in [11], the model trained 
on 95 images has an average deviation of 93.5 um from the ground 
truth. Due to the rule-based partitioning into GW and DW, in case of 
doubt, the deviation from one pixel value may yield a different result. 
One pixel value corresponds to a deviation of 46.8um in height and 
a difference of 18um in width. The borderline cases are welds where 
the width or the minimum height of the weld bead was barely reached 
with one method and just missed with the other. 

When evaluating the results based on the camera images, it is notice- 
able that more pin pairs with height offset were detected as GW. This 
wrong classification can be attributed to the fact that the height offset is 
not considered in any of the used image-based classification features. 
The offset cannot be identified by the shape, size of the weld bead or 
the area of the unwelded pin surface. Therefore, this error case unfor- 
tunately often remains undetected. On the other hand, samples that are 
incorrectly classified as DW can be attributed to tiny weld beads. If less 
material was melted during the process, the welds often have a rather 
rectangular shape due to the pin shape. In some cases, the height of the 
weld is sufficient to create a stable weld, although it still has an edged 
shape. Based on the camera image, these samples are classified as DW 
because they look very similar to the unstable low-power welds. GWs 
with a round weld bead are reliably detected as GWs. 

The evaluation with a few line scans also shows more deviating re- 
sults than the evaluation with 3D reconstruction. In addition to bor- 
derline cases, these methods incorrectly classify pin pairs in which one 
of the pins was only partially connected or welds with spatter as GW. 
Especially when evaluating with only three scans in the x-direction, 
insufficiently welded pins (e.g. Figure 1(c, d)) were missed more often. 
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Figure 6: Comparison of the results of the different methods. The results of the ap- 
proaches: Camera image (Cam), Al-based 3D reconstruction (3D-R), six line 
scans OCT (6L) and three line scans OCT (3L) are compared with ground truth 
based on the features from the entire height map. 


6 Conclusion 


We have developed and compared different methods for quality as- 
sessment in hairpin welding. In addition to analyzing the acquired 
height profile, we have successfully determined the quality based on a 
grayscale image. For the image-based evaluation, we used two differ- 
ent approaches. First, we used features derived from the image, such 
as the width and shape of the weld, to perform a quality assessment. 
The most significant deficiencies were pin pairs, which have an offset 
between the pins. This misalignment is not captured in the image- 
based features and, thus, is not considered in the quality assessment. 
With this approach, the misalignment would have to be checked and 
corrected before welding, completely avoiding the faulty weld. The 
significant advantage of using the image-based features is that no ad- 
ditional height scanner is needed, which reduces cost, setup effort, and 
acquisition time and allows quality analysis through a software up- 
date. The calculation of the binary mask following the approach of [6] 
only requires 16 ms on an i5-7300U CPU. It can be integrated into the 
process with the subsequent algorithmic evaluation without additional 
hardware requirements. In a second approach, we performed an Al- 
based 3D reconstruction on a single grayscale image and then used 
the computed height data for quality assessment. With this approach, 
we achieved higher accuracy and could correctly assign the test sam- 
ples, except for some borderline cases. The approach presented in [11] 
allows reconstruction based on a single grayscale image. For this pur- 
pose, a small SDU-Net architecture is used, which can be executed on 
an i5-7300U CPU in only 45 ms. This method opens up a new pos- 
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sibility for quality evaluation. Unlike feature-based evaluation of the 
camera image, a height scanner is required to train the Al model. Af- 
terward, however, only one camera image is needed in the productive 
system, and the time for the height scan can be saved. 

In future work, we will integrate the developed solutions into the 
manufacturing process and evaluate the results on other components 
than hairpins. In addition, the robustness and transferability of an Al 
model for calculating the height profile between different plants will be 
further investigated. Depending on the results, it might be necessary to 
improve the networks or the algorithms used downstream for quality 
assessment. 
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Abstract In this research, we investigate possibilities to train 
convolutional neural networks with a small dataset for semantic 
segmentation, while achieving the best possible model general- 
ization. In particular, we want to segment corrosion on the sur- 
face of industrial objects. In order to achieve model generaliza- 
tion, we utilize a selection of established and advanced strate- 
gies, i.e. Self-Supervised-Learning. Besides radiometric- and 
geometric-based data augmentation, we focus on model com- 
plexity regarding encoder and decoder, as well as optimal pre- 
training. Finally, we evaluate the best performing model against 
a pixel-wise random forest classification. As a result, we achieve 
an fl-score of 0.79 for the best performing model regarding the 
segmentation of corrosion. 


Keywords Semantic segmentation, classification, machine vi- 
sion, surface inspection, corrosion detection, quality assurance 


1 Introduction 


In the field of machine vision (MV), image segmentation techniques 
are heavily utilized for the surface inspection of industrial objects [1]. 
Image segmentation leads to image regions that can represent image 
texture in a geometrically precise manner. Well established segmen- 
tation methods like thresholding, clustering or region growing, how- 
ever, have the disadvantage of lacking semantic information. Newer 
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deep-learning-based (DL) segmentation methods based on convolu- 
tional neural networks (CNN) are capable of adding semantics implic- 
itly within the training process. These methods are often based on 
fully-convolutional-networks (FCN), which only consist of convolution 
layers as learnable layers, besides optional batch-normalization. FCNs 
can be viewed as functions that map an input image to a map of n € C 
scores per pixel, where C denotes a set of class labels. By applying an 
argmax function, the most likely class c is chosen for a particular pixel. 
While DL-based models outperform pre-DL methods on large datasets, 
the downside of such models is the potential of overfitting due to the 
large amount of model parameters. In a lot of practical applications, 
however, no adequate amount of data is available [2]. Among other 
applications, common MV tasks in the area of surface inspection lack 
a sufficient amount of data in order to train a DL-based model to gen- 
eralize well. Recent advances in DL research target the challenges of 
small training datasets. 

This work aims to utilize a selection of these advanced learning strate- 
gies as well as established methods in order to approximate the best 
possible model generalization. Our scenario includes a barrel as it is 
used for the storage of low radioactive waste (Figure 1(a)), which we 
from now on refer to as our object. The training set consists of an 
RGB image Itai, of the unwrapped coat of the object (Figure 1(b)), 
whereas the test set consists of an RGB image Ites+ of the bottom. Both 
sets are labeled to separate the image pixels into eight classes. In our 
previous work [3], we already utilized the coat for training and also 
testing, though both datasets were from different areas of the coat and 
therefore disjunct from each other. In this new work, however, ITest is 
aquired under different illumination conditions, which sets both It,ain 
and I test even further apart from each other regarding the image char- 
acteristic. For our scenario, we exclusively use I7,,n and no additional 
image datasets or unlabeled data for training. Merely, we use Iyyain 
without labels within a model training at some point in this work. To 
train a model, It;gin is split up into smaller image patches for model 
input. We employ established and widely used data augmentation 
(Section 3.1) techniques by applying geometric and radiometric image 
transformations. Another aspect of our work is encoder pretraining 
(Section 3.3). For this purpose, we train the models with randomly ini- 
tialized model parameters according to some normal distribution and 
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with ImageNet-pretrained parameters. As a third encoder pretraining 
strategy, we employ self-supervised-learning (SSL) [4]. For measur- 
ing the impact of model complexity, we undertake a model selection 
(Section 3.2). Therefore, we use two encoders with different depth of 
the same model family: ResNet18 and ResNet50 [5]. The accompany- 
ing decoder architectures are U-net [6] and DeepLabv3 [7]. The stated 
techniques are stages in a training pipeline, where the best performing 
technique per stage gets chosen. For comparability, we also employ 
a pre-DL algorithm to evaluate the results of both learning domains 
against. A random forest classifier [8] (RF) (Section 3.4) therefore is 
applied within the RGB feature space. The result is a pixel-wise classi- 
fication without further contextual information. 


(b) 


Figure 1: Barrel in the test facility (a). Within the facility, the image data of the coat and 
bottom is acquired. Image of the unwrapped coat (b), used for training our 
models. 


2 Related work 


The automated detection of structural damage such as corrosion on in- 
dustrial objects based on image data is an active field of research [9]. 
Specifically, many research efforts are focused on applying end-to-end 
DL to the task of corrosion detection [10]. This, however, poses the chal- 
lenge that DL approaches are typically data-hungry, requiring large 
amounts of training data, while publicly available, labeled datasets for 
corrosion detection are few and far between [11]. Furthermore, the vi- 
sual appearance of corrosion is quite specific w.r.t. the respective target 
materials and shapes and it is still an open research question to what 
extent the recently published dataset from [12] can be transferred to 
specific application scenarios such as the coated steel barrels used in 
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our work. Thus, while other research addresses the question of how to 
alleviate the effort of creating large-scale training datasets for corrosion 
detection, e.g. via crowd-sourcing [13] or efficient labeling tools [14], 
we focus on how to efficiently use small amounts of training data in a 
DL context by evaluating the impact of pretraining methods from the 
fields of SSL in relation to the results of state-of-the-art DL networks 
on acommon small-scale dataset. 

DL-based corrosion detection can be approached as a classification 
problem, where image regions are classified w.r.t. the presence of cor- 
rosion in a sliding window manner [15] . Sometimes, the results of a 
sliding window classification are further post-processed to yield pixel- 
wise segmentation results, e.g. via the activation maps of patches that 
have been classified as containing corrosion [16]. Other works aim at 
detecting corrosion by means of DL-based object detection networks 
such as R-CNNs [17]. Here, first, instance-wise bounding boxes are 
regressed which are subsequently refined to pixel-wise segmentation 
masks. Lastly, as is the case in our work, DL-based corrosion detection 
can be approached as a semantic segmentation task. In [18], differ- 
ent fully convolutional segmentation networks are comparatively eval- 
uated for the task of segmenting corrosion spots on steel structures. 
In [19], fully convolutional segmentation is compared against an ap- 
proach based on R-CNN. As the results are found to not be precise 
enough, they are refined by a contour-aware postprocessing approach. 
Lastly, [20] apply DeepLabv3 in a multi-temporal setting for damage- 
progress monitoring. 


3 Methodology 


In this section, the methods and the utilized datasets are described. 
Two methodological strings are applied: One string represents the 
model pretraining with the application of all possible encoder-decoder- 
combinations. Aside from that, these models are trained with the base- 
line dataset, as well as the augmented dataset. The last string is a 
pixel-wise RF classification within the RGB feature space. 
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3.1 Data augmentation 


As mentioned previously, the baseline dataset for training consists of 
102 image patches of size 512 x 512 px. In the data augmentation pro- 
cess, geometric and radiometric transformations are applied to these 
patches. The geometric transformations consist of rotations, as well as 
a combined crop and resize operation. Because CNNs are rotation in- 
variant to only some degree, the distinction between an image patch 
and its rotated variant should have a positive effect on the generaliza- 
tion capability. The second geometric transformation is a combined 
crop and resize operation. A crop of an image is chosen randomly 
and then resized to the original image patch size. The resizing oper- 
ation utilizes a bilinear interpolation. With this combination, we aim 
at creating new appearences of texture, which differ from the original 
image patch. Finally, the radiometric transformation consists of a color 
space transformation to HSI, where saturation and intensity are ran- 
domly varied. The image patch then gets transformed back to RGB. 
This strategy is applied to simulate different illumination situations. 


3.2 Model selection 


The model complexity is one aspect of our investigation. Usually in 
ML, in order to prevent overfitting, one strategy is to reduce the model 
complexity, or to be more specific, the number of model parameters. In 
the case of DL-based models, one possibility to achieve this is to con- 
sider different depths of a model. Another aspect is the selection of a 
decoder, which is responsible for upsampling the learned features to a 
map of classification scores with the size of the original image. 

Encoders. We utilize the ResNet architecture [5] for our investigations. 
This architecture is found quite often in literature as a standard model. 
ResNets are used with different depths. We employ a ResNet18 as 
the small encoder with a rather low complexity. The ResNet50 on the 
other hand is selected as the large encoder, as it contains 50 convolution 
layers. Large encoders have the advantage of learning more distinct 
features in the lower convolution layers, but have more parameters to 
optimize as a disadvantage regarding small training sets. Because the 
surface textures of our object are not very complex, we aim for better 
generalization while not requiring such distinct features by applying 
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the ResNet18 encoder. 

Decoders. For similar reasons as mentioned before, we select the U- 
net and DeepLabv3 architectures as the decoders, as these are com- 
monly used in the domain of semantic segmentation. The U-net 
has a feature preservation aspect to it, because of the so-called skip- 
connections. These skip-connections map the output of a convolution 
layer to its corresponding transpose-convolution layer on the decoding 
side. The DeepLabv3 achitecture applies so-called atrous convolutions 
and atrous spatial pyramid pooling. The former is applied to yield a 
more dense feature representation in the upscaling process. The latter 
is applied to include scale invariance to some extent. 


3.3 Encoder pretraining 


To pretrain the encoder, we apply three different methods: random ini- 
tialization according to a normal distribution, pretraining on the Im- 
ageNet dataset and SSL. For the latter method, this is achieved by 
training an encoder model within an SSL model and then by apply- 
ing transfer learning, in order to embed the pretrained encoder into 
the segmentation architecture, which is done by extracting the encoder 
from the SSL model and append a decoder afterwards. 

The random initialization often is the default in popular frameworks 
in contrast to setting the parameters to some constant value. In our 
case, the parameters are initialized according to the normal distribu- 
tion parameterization described in [5], with N(0, eae where n is de- 
rived from the number of input features as well as the filter size and | 
as a layer index. 

ImageNet pretraining is popular, because of the transfer learning as- 
pect. Only the features on the first layers of training are of interest be- 
cause there, low-level features like edges, point-like shapes or corners 
are already learned. This can help for faster convergence or maybe 
even convergence at all. Of course, pretraining on other datasets is also 
possible, especially if they are semantically related to the follow-up 
training domain. 

The field of self-supervised-learning is densely connected to the DL- 
field with a highly active ongoing research. An SSL model is trained 
on exclusively unlabeled data. In our case, a contrastive SSL method is 
applied: SimCLR [21]. This method takes a sample out of the dataset 
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as a positive sample, and another disjunct sample as the negative sam- 
ple. For higher distinctiveness, also data augmentation strategies are 
applied to the positive sample. A contrastive loss is then calculated 
from both samples for backpropagation. The purpose of SSL methods 
is to learn feature representations without knowledge about semantics. 
This is called the pretext task. From there, the underlying model can 
be extracted and added to a so-called downstream task. This proce- 
dure can be viewed as a transfer learning. The downstream task in our 
case is the semantic segmentation, where the SSL pretrained encoder is 
embedded into. 


3.4 Random forest classification 


The application of an RF classifier in the RGB feature space is done 
for evaluation. DL methods outperform pre-DL learning techniques 
on benchmark datasets in the most cases. In our use-case with a com- 
paratively low amount of data, however, such methods might still out- 
perform DL models. 


4 Experiments 


This section describes our experimental setup in the domain of method- 
ology. Our goal is to detect corrosion as segmented image regions. The 
other classes are rejection classes and therefore not of further interest. 
We have four classes in total: lacquer, dirt, spots and corrosion. In our 
investigations we found that an over-classification leads to a better sep- 
arability between corrosion and non-corrosion. 

As mentioned in the Section 1, we use the image of the unwrapped 
barrel coat Iti, as our training dataset. It is an RGB image of size 
3072 x 8763 px. For the specification of image size we use the notation 
of height x width throughout this work. The models are trained with 
smaller image patches with no overlap to neighboring patches, cut from 
Ityain. Those image patches are of size 512 x 512 px. With this size, we 
want to preserve as much information of the surface texture as possible 
through keeping spatial coherence. Splitting Ir;.in into patches results 
in 102 image patches as a baseline training dataset, which is used for 
training in the setting of no data augmentation. By applying data aug- 
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mentation, the dataset grows to 8160 image patches. 

For the first variant of encoder pretraining, the encoder is not pre- 
trained, but initialized randomly. We keep the default procedure of 
PyTorch, which distributes the model parameters according to the so- 
called Kaiming initialization [5]. For ImageNet-based encoder pretrain- 
ing, we download the pretrained model parameters from torchvision. 
The SSL pretraining is done using a batch-size of 512 for both en- 
coders. The dataset in both cases is the unlabeled baseline dataset. 
It is trained for 8000 epochs. It should be noted that both the ResNet 
18 and ResNet50 are randomly initialized for the SSL pretraining. 

The DL model training is organized using different combinations with 
and without data augmentation, with ResNet18 and ResNet50 and with 
three different pretraining settings for the encoder, namely randomly 
initialized, ImageNet-, and SSL pretrained. At last, the number of the 
previously mentioned combinations is doubled be employing a U-net 
and DeepLabv3 decoder architecture. In sum, 24 models are trained 
and evaluated. 

For random forest training, we applied 100 trees with a maximum 
depth of 8. The dataset used for this training is the baseline dataset 
with no augmentations. The reason is that the number of datapoints 
(pixels) is sufficient for pre-DL models to generalize well on the one 
hand, but also on the other hand, RFs are fairly robust against overfit- 
ting in general. 

The evaluation uses the classification metrics precision, recall, fl-score 
and overall accuracy. These metrics show the performance for pixel- 
wise classification for each of the methods in order to make them com- 
parable. Another metric for measuring the global per-class overlap of 
correct classified regions is the Intersection Over Union (IoU). Further 
on, the mean IoU (mloU) as another metric is calculated by averaging 
all class-specific IoU values. 


5 Results 


In this section the results of our experiments are shown. Qualitative re- 
sults in the form of visualizations are depicted in Figure 2(a) to 2(f). For 
quantitative results Table 1(a) shows the global metrics of all trained 
models. Table 1(b) shows the best performing DL model and the RF 
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model with class-wise metrics each. 


(d) (e) (£) 


Figure 2: Itet (a), ground truth (b), result of the RF classification (c). Ground truth 
and predictions are colored as: lacquer (yellow), dirt (bright gray), spots (dark 
gray), corrosion (red). Predictions of the DL models: worst performing model 
(rn50-dl-noaug-inet) (d), best performing model with all classes (rn50-u-noaug- 
rand) (e), best performing model with two aggregated classes no corrosion 
(green) and corrosion (red). 


6 Discussion 


Regarding the model complexity, ResNet18 and ResNet50 yield com- 
parable results for the global metrics. For the detection of corrosion, 
however, ResNet50 usually shows better results. This holds true for the 
four best performing models concerning the fl-score of the corrosion 
class. This indicates that lower model complexity does not necessarily 
lead to better model generalization as proposed in Section 3. 

DeepLabv3 and U-net decoders seem to be on par regarding global 
metrics, as well as for the corrosion detection. The highest fl-score 
for the corrosion class is achieved by a U-net model. Further on, 
DeepLabv3 seem to yield more smoothed results in the visual domain, 
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Table 1: The global metrics (a) of each model are shown. The model names are encoded 


as follows: encoder-decoder-augmentation-pretraining. The metrics are mean fl- 
Score (mFl), overall accuracy (OA) and mean Intersection over Union (mloU). 
The marked model is not the best performing regarding the global metrics, but 
the best performing for the corrosion class. The class-wise metrics (b) are shown 
for the best performing DL model regarding the fl-score of the corrosion class, 
and the RF model. In addition to fl-score (F1) and intersection over union (IoU), 
precision (P) and recall (R) are depicted. 


(a) (b) 


Model OA mF1 mloU Model Class P R Fl IoU 
rn18-u-noaug-rand [0.52 0.29 0.35  rn50-u-noaug-rand |Laquer 0.86 0.03 0.05 0.10 
rn18-u-noaug-inet |0.65 0.37 0.48 Dirt 0.05 0.99 0.09 0.04 
rn18-u-noaug-ssl [0.75 0.41 0.60 Spots 0.00 0.00 0.00 0.00 
rn18-dl-noaug-rand [0.81 0.30 0.68 Corrosion 0.83 0.63 0.71 0.21 
rn18-dl-noaug-inet |0.39 0.20 0.24 random forest Laquer 0.95 0.83 0.86 0.82 
rn18-dl-noaug-ssl [0.63 0.27 0.46 Dirt 0.06 0.21 0.09 0.05 
rn18-u-aug-rand [0.86 0.36 0.76 Spots 0.00 0.00 0.00 0.00 
rn18-u-aug-inet 0.91 0.39 0.83 Corrosion 0.82 0.76 0.79 0.42 
rn18-u-aug-ssl 0.82 0.29 0.70 rn50-u-noaug-rand!No Corrosion0.98 0.99 0.99 0.90 
rn18-dl-aug-rand [0.88 0.37 0.79 Corrosion 0.83 0.63 0.71 0.21 
rn18-dl-aug-inet 0.89 0.39 0.80 random forest No Corrosion|0.99 0.99 0.99 0.96 
rn18-dl-aug-ssl 0.88 0.43 0.78 Corrosion [0.82 0.76 0.79 0.42 
rn50-u-noaug-rand |0.10 0.21 0.05 
rn50-u-noaug-inet |0.80 0.41 0.67 
rn50-u-noaug-ssl [0.54 0.38 0.37 
rn50-dl-noaug-rand 0.39 0.24 0.24 
rm50-dl-noaug-inet |0.75 0.24 0.61 
rm50-dl-noaug-ssl [0.63 0.27 0.46 
rn50-u-aug-rand [0.86 0.36 0.76 
rn50-u-aug-inet 0.91 0.39 0.83 
rm50-u-aug-ssl 0.82 0.29 0.70 
rn50-dl-aug-rand [0.92 0.42 0.85 
rn50-dl-aug-inet [0.86 0.40 0.76 
rn50-dl-aug-ssl 0.88 0.40 0.79 
random forest 0.80 0.44 0.67 


whereas some U-net-based models tend to show slightly more scat- 
tered segmentation results. 

A surprising insight is that data augmentation did not seem to have 
a positive effect for all models. Moreover, we could only observe in 
the three best models, regarding fl-score in the corrosion class, that 
DeepLabv3 decoders benefit from data augmentation and tend to 
perform poor without data augmentation, while this tends to be the 
opposite case with U-nets. 

For the encoder pretraining, we could not observe tendencies re- 


82 


Semantic segmentation with small training datasets 


garding the different pretraining strategies resulting in a superior 
performance. This is especially of interest, because random initializa- 
tion is usually considered as an inferior starting point for training. 
In our experiments, the random initialization performs similar w.r.t. 
the other pretrainings. In literature, usually thousands of unlabeled 
images are utilized for SSL. As can be seen in Table 1(a), no gain could 
be achieved with SSL pretraining. It can be assumed that the 102 
image patches were too few for a substantial SSL pretraining. 

The random forest classification yields the best results regarding the 
fl-score of the corrosion class. It needs to be considered, however, that 
the RF classifier does not take context into account in our experiments. 
This leads to results with less smoothness in some regions where the 
separability in RGB space is not very pronounced. Especially larger 
areas of corrosion are prone to false negatives in the form of scattered 
pixels belonging to other classes. 


7 Conclusion 


For our applied strategies in order to train a DL model to generalize 
from a small baseline dataset, we found that for the core class of corro- 
sion, a RF classifier performs better within the RGB feature space than 
a DL-based model. The RGB feature space in our case is well separa- 
ble: There is no surface texture with a similar radiometric signature to 
that of corrosion in Irest- Also, for the incorporation of context in the 
non-DL domain, a conditional random field could be of advantage. For 
the enrichment of the feature space, textural features can be extracted 
and added for training. 

For the DL domain we found that there is still a large potential for 
improvement. While strategies like data augmentation are mandatory 
for a long time in such scenarios, we could not see a significant advan- 
tage. We only touched the surface of what is possible, with mediocre 
results at this point. Other possibilities are to incorporate unlabeled 
datasets for Semi-Supervised-Learning or a large scale Self-Supervised- 
Learning for better encoder pretraining. Also, Few-Shot-Semantic- 
Segmentation techniques can be taken into account in the future, as 
there is a fairly high research activity in this area. 
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Zusammenfassung In diesem Beitrag werden intelligente Qua- 
litätssicherungslösungen für die automatische Erkennung ver- 
schiedener Fehlerklassen im industriellen Fertigungsprozess un- 
ter Optimierung der Bildverarbeitungs- und Mustererkennungs- 
kette auf Basis von Deep Learning diskutiert. Exemplarisch 
werden intelligente Qualitätssicherungslösungen für die indus- 
triellen Fertigungsprozesse Kunststoffspritzguss von mikroflui- 
dischen Bauteilen in der Medizintechnik sowie von Makro- 
bauteilen im Automobilbau aufgezeigt. Die Anwendung leis- 
tungsfähiger Deep-Learning-Algorithmen mit ihrem Prinzip- 
bedingt gegebenen höheren Generalisierungs- und Abstraktions- 
vermögen ermöglicht smarte intelligente In-Prozess-Lösungen 
zur Evaluierung der Fertigungsqualität und ermöglicht auch 
Rückschlüsse zum Fertigungsprozess selbst. In diesem Beitrag 
werden die relevanten Aspekte zur Lösung verschiedener in- 
dustrieller Qualitätssicherungsaufgaben mittels tiefer neuronaler 
Netze näher beleuchtet. 
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Schlüsselwörter Deep Learning, Convolutional Neural Net- 
work (CNN), Künstliche Intelligenz 


Abstract This paper discusses intelligent quality assurance so- 
lutions for the automatic detection of different defect classes in 
industrial manufacturing processes by optimizing the image pro- 
cessing and pattern recognition chain based on Deep Learning. 
Exemplary intelligent quality assurance solutions for the indus- 
trial manufacturing processes plastic injection molding of mi- 
crofluidic components in medical technology as well as macro 
components in automotive manufacturing are shown. The appli- 
cation of powerful deep learning algorithms with their principle- 
based higher generalization and abstraction capability enables 
smart intelligent in-process solutions for the evaluation of man- 
ufacturing quality and also allows conclusions to be drawn about 
the manufacturing process itself. In this paper, the relevant as- 
pects for solving various industrial quality assurance tasks using 
deep neural networks are examined in more detail. 


Keywords Deep learning, convolutional neural network (CNN), 
artificial intelligence 


1 Motivation und Ziele der vorgestellten Forschung 


Qualitätssicherungsaufgaben im heutigen Produktionsumfeld haben 
in aller Regel, völlig unabhängig vom Produktionsprozess selbst, die 
Gemeinsamkeit, dass die automatisierte Qualitätsevaluierung in Form 
der Produktanalyse nur durch eine Übertragung des Experten-Apriori- 
Wissens auf ein maschinelles System umgesetzt werden kann. Hierfür 
werden neben einem Problem-angepassten Bildverarbeitungssystem 
im Falle der optischen Signalerfassung und -verarbeitung auch eine 
intelligente algorithmische Umsetzung der Bildverarbeitungs- und 
Mustererkennungskette notwendig sowie die Zusammenstellung 
des Experten-Apriori-Wissens in Form von manuell klassifizierten 
Datensätzen für ein anschließendes Klassifikatortraining. Damit 
wird deutlich, dass Methoden der Künstlichen Intelligenz (KI) zur 
Lösung heutiger Qualitätssicherungsaufgaben in der intelligenten, 
ressourcenschonenden industriellen Produktion unerlässlich sind. Da 
im Produktionsprozess in aller Regel die existierenden Fehlerklassen 
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bereits im Vorfeld bekannt sind und die Erkennungsperformance 
auf einem hohen Niveau liegen muss, finden für derartige Qua- 
litätssicherungsaufgaben überwachte maschinellen Lernverfahren (su- 
pervised machine learning) Anwendung. Im Bereich des überwachten 
maschinellen Lernens gibt es eine Vielzahl von KI-Methoden, sowohl 
konventionelle als auch Deep-Learning-Methoden. Insbesondere das 
Deep Learning hat aufgrund vielversprechender Ergebnisse in vielen 
Bereichen von Wissenschaft, Industrie und Alltagsleben mittlerweile 
stark an Bedeutung gewonnen. Der Erfolg tiefer neuronaler Netze 
hängt unmittelbar mit der gestiegenen Rechenperformance und 
insbesondere der Rechenleistungszunahme der High-Performance- 
Graphik-Karten zusammen, welche eine Berechnung neuronaler Netze 
solcher Kapazität überhaupt erst möglich machen. 

Mit Deep-Learning-Netzen können grundsätzlich sehr gute Erken- 
nungsraten erzielt werden, wenn entweder vortrainierte neuronale 
Netze verwendet werden, welche auf der Basis von Bildern ähnlicher 
industrieller Erkennungsaufgaben vortrainiert wurden, oder wenn 
sehr große Mengen an vorklassifizierten Trainingsdaten zur Verfügung 
gestellt werden können (sehr kosten- und zeitintensiv). In diesem 
Beitrag werden innovative Qualitätssicherungslösungen zur auto- 
matischen Erkennung verschiedener Fehlerklassen im industriellen 
Fertigungsprozess beim Kunststoffspritzguss von mikrofluidischen 
Bauteilen in der Medizintechnik und Makrobauteilen im Automobil- 
bau untersucht und vorgestellt (siehe Abbildung 1). 


Abbildung 1: Prüfteilbeispiele: mikrofluidisches Bauteil aus dem Bereich der Medi- 
zintechnik (links) [1] und Makrobauteile aus der Automobilindustrie 
(rechts) in Form von einem Reflektorbauteil (oben) und einem LED- 
Gehäusebauteil (unten) [2]. 


Für beide untersuchte industrielle Qualitätssicherungsaufgaben be- 
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steht die Notwendigkeit, eine angepasste Bildverarbeitungs- und 
Mustererkennungskette sowie die Anwendung leistungsstarker 
KI-Algorithmen mit erhöhten Generalisierungs- und Abstrakti- 
onsfähigkeiten zu realisieren. Die Lösungen der untersuchten 
industriellen Erkennungsaufgaben haben gemeinsam, dass insbe- 
sondere innovative vortrainierte Deep-Learning-Netzwerke [3] gute 
Ergebnisse liefern können. In diesem Beitrag werden die notwendigen 
Schritte zur Lösung einer automatisierten Qualitätssicherung im 
Mikrofluidik-Kunststoffspritzguss und im Makrokunststoffspritzguss 
gegenübergestellt und die verschiedenen Aspekte einer angepassten 
Bilderfassung und eines Klassifikationsroutinendesigns näher beleuch- 
tet. 

Im Kunststoffspritzguss von Makrobauteilen im Automotive wird ein 
Prüfsystem für die fertigungsintegrierte Prüfung komplex strukturier- 
ter Kunststoffbauteile vorgestellt. Das robotergestützte Prüflingshand- 
ling ermöglicht sowohl eine vollautomatische Stichproben- als 
auch eine 100%-Kontrolle und Aussortierung fehlerhafter Bauteile 
in Abhängigkeit zur gewählten Taktzeit der Spritzgussmaschine. 
Auf Basis moderner maschineller Lernverfahren wird sowohl die 
Maßhaltigkeit überprüft als auch die Oberflächenbeschaffenheit auf 
kleinste Fehler wie Einschlüsse, Blasenbildung, lokale Verformungen 
oder Farbabweichungen untersucht. Durch den Einsatz adaptiver 
Prüfmethoden können die Prüfverfahren an neue Bauteilgeometrien, 
Materialien und Farbmerkmale angepasst werden. Die kurze Prüfzeit 
ermöglicht einen Echtzeitbetrieb auch bei hohen Durchsatzraten. 
Damit ist diese Methode im besonderen Maße für den Einsatz im 
Spritzguss von Makrobauteilen geeignet. Die im Spritzguss hergestell- 
ten untersuchten Bauteile kommen in Fahrzeugen zum Einsatz und 
unterliegen einem hohen Produktionsvolumen bei gleichzeitig kurzen 
Produktlebenszyklen, was eine schnelle, effiziente, kostengünstige 
und adaptierbare Lösung notwendig macht. Dank eines speziell 
entwickelten robotergestützten Bildaufnahmesystems können auch 
komplexe und transparente Objekte inspiziert werden. 

Im Kunststoffspritzguss von mikrofluidischen Bauteilen wurde ein 
innovatives Inspektionssystem zur fertigungsintegrierten Prüfung 
von komplex strukturierten mikrofluidischen Kunststoffbauteilen 
für Lab-on-a-Chip-Anwendungen in der Diagnostik erarbeitet. Diese 
medizintechnische Anwendung von spritzgegossenen mikroflui- 
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dischen Bauteilen erfordert eine hochpräzise und 100%ige Kontrolle 
der Produktion. Die optische Qualitätskontrolle wurde durch die 
Entwicklung eines QC-Prototyps und einer angepassten KI automa- 
tisiert realisiert. Auch hier werden Algorithmen aus dem Bereich 
des maschinellen Lernens zur Auswertung der Bilddaten eingesetzt. 
Übergeordnetes Ziel war es, die Qualität und Produktivität der 
gesamten Wertschöpfungskette zu steigern. 


2 Stand der Technik der Qualitätssicherung mittels 
Bildverarbeitung und Kl in der industriellen Produktion 


Die digitale Bildverarbeitung spielt eine herausragende Rolle in der 
Qualitätssicherung von Produktionsprozessen. Neben den klassischen 
Anwendungen ebnete sie auch den Weg für Lösungen im Bereich der 
Industrie 4.0 [4], [5]. Bei der Oberflächenprüfung werden in der Praxis 
häufig manuelle Stichprobenprüfungen durchgeführt, die mit einem 
hohen Zeitaufwand und subjektiven, prüferabhängigen Ergebnissen 
verbunden sind. Aufgrund immer schnellerer Produktionsprozesse, 
fortschreitender echtzeitnaher Anforderungen in der Qualitätsanalyse 
und erheblicher Fortschritte in der Rechen- und Analysetechnik sind 
manuelle Oberflächenanalysen nicht mehr zeitgemäß, weshalb auch 
ein schnell fortschreitender Wechsel zu automatisierten Verfahren be- 
obachtet werden kann [6]. Für unterschiedliche Anwendungsbe-reiche 
existieren bereits verschiedene Methoden zur Oberflächenanalyse 
unter Verwendung von Bildverarbeitungsmethoden, z.B. die Detektion 
von Defekten auf Plattenmaterial, die Inspektion von lackierten Ober- 
flächen und die Bildanalyse zur Defekterkennung auf Wafern [7]. Für 
die Analyse von Oberflächendefekten auf industriell gefertigten Ober- 
flächen werden meist Verfahren der Texturanalyse eingesetzt [8], [9]. 
Für die Automatisierung von Oberflächeninspektionsaufgaben sind 
maschinelle Lernverfahren zur Realisierung der Klassifizierung in 
in-Ordnung (iO) oder nicht-in-Ordnung (niO) bzw. einzelne Fehler- 
klassen zwingend erforderlich. Beim maschinellen Lernen erkennt der 
Algorithmus Muster und Regelmäßigkeiten in den ihm zur Verfügung 
gestellten Beispielen und wendet diese auf prak-tische Prüfungen an. 
Während des Lernprozesses wird die Korrelation zwischen Merkma- 
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len und Klassen der Trainingsobjekte ermittelt, die zur Vorhersage 
der Klassen von unbekannten Objekten genutzt wird [10]. Neben 
den klassischen Algorithmen erfährt das sogenannte Deep Learning 
derzeit einen erheblichen Aufschwung. Das Convolutional Neural 
Network (CNN) zum Beispiel ist ein tiefes künstliches neuronales 
Netz, das sich besonders für Bildverarbeitungsaufgaben eignet [11]. 
Während bei klassischen Verfahren relevante Regionen in Bildern seg- 
mentiert und wichtige Merkmale berechnet werden müssen, entfallen 
diese Zwischenschritte bei CNNs, da sie innerhalb des Algorithmus 
automatisiert erfolgen. 

Deep-Learning-Algorithmen haben in den letzten Jahren bemerkens- 
werte Ergebnisse erzielt und übertreffen die Fähigkeit traditioneller 
Methoden, Korrelationen in hochdimensionalen Datensätzen zu fin- 
den. Dennoch gibt es einige Nachteile und Einschränkungen bei der 
Anwendung dieser Algorithmen. Tiefe neuronale Netze benötigen eine 
extrem große Datenmenge, um eine gute Gene- ralisierungsfähigkeit 
zu entwickeln und damit gute Ergebnisse zu liefern [12]. Alternativ 
dazu können vortrainierte CNNs verwendet werden, welche im 
Idealfall bereits mit großen Bilddatensätzen industriellen Ursprungs 
vortrainiert wurden. 


3 Bilderfassung und Datensatzerstellung für die 
untersuchten industriellen 
Qualitätssicherungsaufgaben 


In Abbildung 2 sind die für beide industrieelle Applikationen erarbei- 
teten und für die Untersuchungen verwendeten Bildaufnahmeeinrich- 
tungen dargestellt, links im Bild der roboterassistierte Prüf-stand in 
der Kunststoffspritzgussfertigungsanlage zur Prüfung von Makrobau- 
teilen im Automotive und rechts der Prüfstand für die Prüfung von 
mikrofluidischen Chips für die Medizintechnik. Beide Systeme arbei- 
ten auf Basis optischer Sensoren, unterscheiden sich jedoch stark in der 
technologischen Umsetzung aufgrund sehr unterschiedlicher Anforde- 
rungen im Fertigungsprozess und den zu prüfenden Bauteilen. 
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Abbildung 2: Bildaufnahmeeinrichtungen für die Prüfung von Makrobauteilen im Au- 
tomobilbau (links) [2] und von mikrofluidischen Chips (rechts). 


3.1 Bilderfassung und Datensatzerstellung für die Qualitätssicherung 
von Kunststoff-Spritzgussteilen im Automobilbau 


Zunächst wurden mit verschiedenen Kamerasystemen Voruntersu- 
chungen zur Bildaufnahme durchgeführt, um den Materialcharakte- 
ristiken und Reflexionseigenschaften der Prüfteile zu entsprechen und 
qualitativ hochwertige Bilder gewinnen zu können. Die finale Bildauf- 
nahmeeinrichtung besteht aus einer 5-Megapixel-Kamera mit Auflicht 
und Durchlicht sowie einem Gehäuse zum Schutz vor Fremdlicht. Das 
Handling der Prüfteile wurde vollautomatisch mittels eines Knickarm- 
roboters realisiert (siehe Abbildung 2 links). Die Kommunikation mit 
der Spritzgussmaschine wurde über binäre Signale und ein selbst 
entwickeltes Protokoll realisiert, um eine hohe Stabilität und Sicherheit 
zu gewährleisten. Das Beleuchtungssystem wurde optimiert, um einen 
hohen Kontrast bei der Abbildung kleiner Oberflächendetails und 
Defekte zu erreichen. Schatten, lokale Reflexionen und lokale Über- 
und Unterbelichtungen wurden durch eine hochdiffuse Beleuchtungs- 
charakteristik ähnlich einer Dombeleuchtung minimiert. 

Für die Untersuchung wurden im Projekt zunächst gemeinsam mit den 
beteiligten Partnern die Prüfteile festgelegt und Kriterien zur Prüfung 
erarbeitet sowie ein Fehlerkatalog aufgestellt (Fehlerklassen definiert). 
Hier konnten insbesondere zwei Fehlerarten herausgearbeitet werden: 
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zum einen fehlerhafte Maschinenparameter (insbesondere fehlender 
Nachdruck, falsche Temperatur an der Düse, Defekt an der Spritzguß- 
form) und zum anderen eine fehlerhafte Materialzusammensetzung 
(insbesondere Beimengung von ungeeignetem Kunststoffgranulat, 
Farbabweichungen). Als Prüfteile wurden aufgrund der besonderen 
optischen Herausforderungen ein schwarzes LED-Gehäuse und ein 
transparentes Reflektorbauteil ausgewählt. Im Ergebnis konnten 
verschiedene Datensätze der Prüfteile „LED-Gehäuse” und „Reflektor“ 
gewonnen werden, auf denen die KI-Algorithmen trainiert wurden, 
um diese später erfolgreich in den Demonstrator integrieren zu 
können. Der mit dem Roboter-assistierten Bildaufnahmesystem ge- 
wonnene Bilddatensatz besteht aus rund 500 Objektbildern für beide 
Arten von Prüfteilen (roter Reflektor und schwarzes LED-Gehäuse). 
Die Musterteile wurden zuvor sowohl während der regulären Ferti- 
gung als auch im Rahmen einer gezielten Fehlersimulation gesammelt, 
indem die Prozessparameter und die Materialzusammensetzung so 
verändert wurden, dass bewusst fehlerhafte Teile unter kontrollierten 
Bedingungen produziert werden konnten. Für die Trainingsmenge 
(100 - 150 Beispiele pro Klasse) wurden repräsentative Musterbilder 
der Hauptfehler beim Spritzgießen in Form von „Düse zu heiß”, 
„ohne Nachdruck /zu geringer Nachdruck” und „falsche Granu- 
latzusammensetzung” sowie fehlerfreie „Gutteile” (iO-Teile) für 
das Prüfteil LED-Gehäuse sowie „Gutteile”, „Defekt Einfall” und 
„punktförmige Defekte“ für das Prüfteil Reflektor verwendet. Die 
deutlich erkennbaren Unterschiede zwischen fehlerfreien und feh- 
lerhaften Teilen ermöglichen es, mit einer relativ kleinen Menge von 
Musterteilen einen repräsentativen Datensatz für das Training der KI 
zu erhalten. 


3.2 Bilderfassung und Datensatzerstellung für die Qualitätssicherung 
von mikrofluidischen Kunststoff-Spritzgussteilen für die 
Medizintechnik 


Die hierfür erarbeitete Bildaufnahmeeinrichtung besteht aus einem 
12K-Zeilenkamerasensor und einem 12M-Pixel-Matrixkamerasensor. 
Die resultierende Bildgröße beträgt 25000 x 9000 Pixel. Das gesamte 
Bildaufnahmesystem ist in einem Schutzgehäuse untergebracht, das 
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vor Fremdlicht schützt (siehe Abbildung 2 rechts). Dies ist von großer 
Bedeutung, da schon kleine Veränderungen in der Beleuchtung zu 
einer geringeren Erkennungsrate führen könnten. Das Einsetzen der 
mikrofluidischen Bauteile erfolgt derzeit noch manuell, während die 
anschließende Bilderfassung und -auswertung automatisch durch- 
geführt werden. Für jedes zu prüfende Bauteil werden 26 Bilder mit 
beiden Kameras aufgenommen. Der klassische Basis-Algorithmus 
betrachtet und verfolgt vordefinierte Stellen auf dem mikrofluidischen 
Bauteil, wie um-große fluidische Kanäle und um- bis mm-große 
Hohlräume, und wertet im Anschluss das Bild aus. Im Bereich 
vorkommender optischer Unregelmäßigkeiten im fluidischen Kanal 
werden in einem zweiten nachgelagerten Schritt Bildausschnitte 
der Abmessungen 500 x 500 Pixel, welche die Unregelmäßigkeiten 
enthalten, ausgeschnitten und zur automatischen Prüfung an die 
vortrainierte KI übertragen. Der Ursprung dieser Unregelmäßigkeiten 
kann von unkritischen Lufteinschlüssen um einen Kanal bis hin zu 
einem kritischen Partikel im Kanal selbst reichen, der die Funkti- 
onsfähigkeit des gesamten Bauteils beeinträchtigen kann und daher 
als funktionskritisch sicher erkannt werden muss. Die Rohbilder, die 
Vorschaubilder mit Markierungen und alle gesammelten Ergebnisse 
werden im Ergebnis in einer NAS- und SQL-Datenbank abgelegt und 
gespeichert. 

Die KI-Erkennungsroutine wurde angelernt mit Teilbildausschnitten 
der Größe von 500 x 500 Pixeln, die aus dem mikrofluidischen Kanal 
entnommen und von einem menschlichen Experten vorklassifiziert 
wurden. Der aufgenommene Datensatz besteht aus insgesamt 2.264 
Bildausschnitten der vier Klassen: „Kanal_sauber” (499 Objekt- 
ausschnitte), „Kanal_mit_Flieflinie” (500), „Kanal_Grat” (425) und 
„Kanal_Partikel” (840), wobei die Oberklasse „Kanal_Partikel” wieder- 
rum die drei Partikelklassen: „Partikel_unkritisch“, „Partikel_kritisch” 
und „Partikel_unkritisch Fluse” enthält. Eine Differenzierung der 
verschiedenen Partikelklassen gestaltet sich als schwierig aufgrund 
der sehr unterschiedlichen Lage der Partikel unterhalb, oberhalb oder 
im Kanal, welche in einer 2D-Bildaufnahme ohne Tiefeninformation 
nicht sicher detektiert werden kann. 
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4 Bildverarbeitungs- und Mustererkennungskette 


Abbildung 3 zeigt die Bildverarbeitungs- und Mustererkennungskette, 
die der Lösung jeder Erkennungsaufgabe mit Hilfe des maschinellen 
Lernens zugrunde liegt und welche die Basis für die aufgezeigten 
KI-Lösungen bildet. 


Me N 
< * outlier analysis | + Intrinsic feature 
p” A and elimination, | calculation in the 
Ma * Segmentation Deep Learning \ 
s | network \ 


Classifier 
Validation 


Abbildung 3: Bildverarbeitungs- und Mustererkennungskette. 


5 Kl-Losung fur die Qualitatssicherung von 
Kunststoff-Spritzgussteilen in der Automobilindustrie 


Ein auf industriellen Bilddaten vortrainiertes CNN der Softwarebiblio- 
thek Halcon (enhanced CNN) wurde mit dem gewonnenen Datensatz 
trainiert und optimiert. Es konnten Erkennungsraten (ER) zwischen 
95 und 100 % ftir die verschiedenen Klassen im Laboreinsatz erreicht 
werden, wobei reduzierte rgb-Bilder der Größe von 500 x 500 Pixeln 
verwendet wurden. 

Die im Labortest erreichte Erkennungsleistung des besten Deep- 
Learning-Verfahrens (ebenfalls eine vortrainierte Halcon-CNN) lag 
für die Prüfteile LED-Gehäuse bzw. Reflektor bei einer mittleren 
Gesamt-ER von 96,22 % mit einer Standardabweichung (Stabw) von 
1,92 % bzw. mittleren Gesamt-ER von 97,63 % mit einer Stabw von 
1,70 %. 
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Die erreichbaren Erkennungsleistungen der einzelnen Fehlerklassen 
im späteren Robotereinsatz in der Fertigungsstraße, d. h. im Industrie- 
einsatz unter realen Umgebungsbedingungen, lagen dann zwischen 
90 und 100 %. Die im Industrieeinsatz angestrebte und erreichte 
Klassifikationszeit pro Objekt (ohne Hardware-Handlingzeit) beträgt 
weniger als 1 ms bei einer mittleren Erkennungsleistung von 90 bis 
100 % je nach Einzelklasse. Im Ergebnis konnten damit mit dem 
finalen Roboter-assistierten System und der auf Industriebildern als 
Klassifikator vortrainierten CNN der Bildverarbeitungsbibliothek MV- 
Tec Halcon (Version 18.11) je nach Kunststoffspritzguss-Applikation 
(Art des Bauteils), Modellparametereinstellung und vorkommenden 
Fehlerklassen mittlere Gesamterkennungsraten von deutlich größer 90 
% erzielt werden. 


6 Kl-Losung für die Qualitätssicherung von 
Kunststoffspritzguss-Mikrofluidik-Bauteilen in der 
Medizintechnik 


Die für diese Aufgabenstellung verwendete Erkennungsroutine 
arbeitet mit Teilbildausschnitten der Größe 500 x 500 Pixeln, die 
vom mikrofluidischen Kanal aufgenommen werden. Das Ziel der 
Erkennungsroutine war die automatisierte Erkennung von Fehlern 
des mikrofluidischen Kanals. Teilbildausschnitte wurden der KI 
zur Bewertung von Fertigungsfehlern zur Verfügung gestellt. Da 
der Basisalgorithmus die Kanalverfolgung und Bildausschnitterzeu- 
gung bereits realisiert, mussten bei dieser Erkennungsaufgabe keine 
weiteren Vorverarbeitungsschritte durchgeführt werden. Die Experten- 
vorklassifizierten Teilbilder der iO- sowie der Fehlerklassen (niO) 
konnten direkt für das Deep-Learning-Netzwerk zur Berechnung der 
intrinsischen Merkmale und zum Training verwendet werden. 

Es wurde eine dreifach stratifizierte Kreuzvalidierung verwendet, d. 
h. 2/3 der Objekte (Teilbilder) wurden zum Training und zur Vali- 
dierung und 1/3 zum Testen verwendet. Aus den drei Durchläufen 
mit iterativ vertauschten Trainings- und Testpartitionen und der über 
drei Durchläufe gemittelten Erkennungsrate wurde eine statistisch 
bessere Vorhersagegenauigkeit ermöglicht. Die vier vortrainierten 
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neuronalen Netze „compact“, „enhanced“, ,alexnet” und „resnet50” 
der Softwarebibliothek Halcon 20.05 [3] wurden in ihrer Eignung 
für das gegebene Erkennungsproblem näher untersucht. Für die 
verschiedenen CNN-Varianten wurden unterschiedliche, an die Er- 
kennungsaufgabe angepasste Klassifikatorparameter verwendet. Für 
die Testergebnisse wurden die mittlere Erkennungsrate [%] und die 
Standardabweichung [%] berechnet. Die durchschnittlich erreichten 
Gesamterkennungsraten liegen bei 73,99 % für den vortrainierten 
CNN ,,alexnet”, 92,92 % für „compact“, 97,17 % für „resnet50” und 
98,32 % für „enhanced“. 

Die erreichbaren Erkennungsleistungen der einzelnen Fehlerklassen 
im späteren Fertigungseinsatz stehen noch aus, da sich der Final- 
demonstrator momentan noch im Aufbau befindet. 


7 Zusammenfassung der Ergebnisse und Fazit 


Dieser Beitrag zeigt die erfolgreiche Anwendung von vor- 
trainierten Deep-Learning-Netzwerken für intelligente Qua- 
litätssicherungslösungen zur automatischen Erkennung verschiedener 
Fehlerklassen im industriellen Fertigungsprozess exemplarisch für 
zwei unterschiedliche Applikationen im Kunststoffspritzguss auf. Die 
vorgestellten KI-Lösungen zeigen eine erfolgreiche Implementierung 
der Bildverarbeitungs- und Mustererkennungskette und eine heraus- 
ragende Leistungsfähigkeit vortrainierter Deep-Learning-Netzwerke 
(CNNs). Die höhere Generalisierungsfähigkeit und das höhere Ab- 
straktionsvermögen von CNNs ermöglichen die Realisierung von 
vollautomatisierten Qualitätssicherungsprozessen in der industriellen 
Fertigung. Für beide Anwendungen konnten bei Optimierung der 
verwendeten KI mittlere Gesamterkennungsraten größer 95 % erreicht 
werden. 
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Abstract Seamless image stitching depends not only on the ac- 
curate alignments of camera images, but also on the compen- 
sation of illumination inconsistencies. Even if two images are 
aligned perfectly, the seam is still visible if the images have a 
distinct vignetting or different exposure. Image stitching is used 
to expand the field of view, but a visible seam can lead to signif- 
icant errors in subsequent visual perception tasks. As a result, 
we present a straightforward and accurate method for vignetting 
and exposure correction for stitched images. Firstly, we estimate 
the camera response function that maps irradiance to intensity. 
Then, the vignetting model is determined, which is applied to 
the irradiance images. After that, the exposure of the stitched 
images is corrected with the irradiance values at the seam. Fi- 
nally, the irradiance is converted back into intensity using the 
camera response function. Our approach is evaluated using data 
recorded by our experimental vehicle and the public nuScenes 
dataset. Thereby, we test the performance of our method using 
the IoU of the histograms as well as the mean absolute error of 
the intensity values in the overlapping image regions. Further- 
more, we demonstrate the real-time capability of our approach. 


Keywords Autonomous driving, panorama, image stitching, vi- 
gnetting, exposure, illumination 
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(a) Sensor setup prototype 
for UNICARagil project. 


Driving direction 


© Left facing camera 
© Front facing camera 


© Lidar 


(b) Schematic top view of the UNICARagil sensor setup to 
visualize sensor coverage of color cameras and lidar. 


Figure 1: The images from the two lower color cameras of the UNIC ARagil sensor modul 
are stitched to a 270° horizontal panoramic image to improve object detection 
and other perception tasks. 
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1 Introduction 


Autonomous vehicles heavily depend on camera sensors to perceive 
their surroundings. Object detection, visual localization and mapping 
are fundamental challenges in automated driving based on camera im- 
ages. Instead of performing perception tasks for each individual im- 
age, the images can be fused to a panorama beforehand [1]. Thus, the 
horizontal viewing angle can be significantly expanded using image 
stitching. This facilitates object detection, especially when an object is 
cut off at the image boundaries by a limited field of view. Image stitch- 
ing precisely aligns individual images based on image features or lidar 
measurements. However, the seam is still visible due to vignetting and 
different exposure times. On the one hand, as shown in [2], vignetting 
is caused by a radial falloff in irradiance at the image boundaries, while 
on the other hand, the cameras adjust the exposure time to the current 
lighting conditions. As a result, the seam between stitched images can 
lead to false features in object detection and other processing tasks. 

In this paper, we propose a straightforward method for compensat- 
ing vignetting and correcting exposure for multiple stitched images in 
a time-critical environment. This distinguishes our method from many 
approaches that aim to compensate for vignetting in individual images 
using more complex models [3-5]. Hence, we estimate the camera re- 
sponse function (CRF) and the vignetting model before runtime. After 
the images are stitched, the vignetting model is applied and the expo- 
sure is corrected. Our approach is tested on the sensor setup prototype 
built as part of the publicly funded project UNICARagil [6,7], as well 
as on the public nuScenes dataset [8]. The prototype of the sensor 
setup mounted on a vehicle of the Karlsruhe Institute of Technology 
is shown in Fig. 1(a). In this setup, the camera images from the front- 
facing camera and the left-facing camera are stitched together to create 
a panorama. The sensor coverage of the two cameras and the lidar, 
which allows better alignment of the images, is shown in Fig. 1(b). 


2 Related Work 


Since cameras are widely used, inexpensive sensors, many articles have 
already addressed vignetting and exposure correction. Goldman and 
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Chen provide a good overview of the causes of vignetting and sug- 
gest a general approach by modeling the vignetting model as 6th order 
even polynomial in [2]. In addition, the approaches of Zheng et al. 
in [3,4] and Cho, Lee, and Lee in [5] focus on vignetting correction for 
single images. Furthermore, the approach of Kordecki, Palus, and Bal 
propose the use of a non-radial vignetting model in [9]. 


Raw 


Intensity Images ~~ Inverse CRF — Raw 
T Irradiance Images 


| 
| 
{ 
Vignetting and 
Exposure Correction 
| 
Image Stitching without 
Vignetting and 
Exposure Correction 
| 


Corrected 
Irradiance Images 


CRF 


Corrected 


Seamless 3 
i Int I 
Stitched —— Image Stitching DEE 
Intensity Images 


Figure 2: Workflow for vignetting compensation and exposure correction in image stitch- 
ing. 


3 Vignetting and Exposure Correction 


Our approach consists of four individual steps. The workflow of our 
approach is depicted in Fig. 2. First, the camera response functions 
of all cameras are estimated. Then, the vignetting model is generated 
and applied. Afterwards, the correction of the exposure between the 
stitched camera images is performed. In the last step, the corrected 


104 


Real-time multi-image vignetting and exposure correction 


(c) Image stitching with vignetting and exposure correction. 


Figure 3: Comparison of image stitching with and without vignetting compensation and 
exposure correction. The images are recorded with the UNICARagil sensor 
setup prototype in Fig. 1. 
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irradiance values are converted back into intensities. The quantitative 
effect of our approach is shown in Fig. 3 for an exemplary pair of 
stitched images. 


3.1 Calibration of the Camera Response Function 


Both the vignetting compensation and the exposure correction are per- 
formed based on the irradiance, which is calculated from the inten- 
sities using the non-linear camera response function. Therefore, we 
determine the camera response function of our sensor setup before the 
actual runtime. The camera response functions are estimated by expo- 
sure series in a static scene with known exposure times as in [10]. For 
each of the color cameras in Fig. 1, we obtain three response functions 
for the three color channels. However, we found that the camera re- 
sponse functions are approximately identical for the cameras and all 
color channels. The qualitative evaluation of UNICARagil sensor data 
shows decent results for vignetting and exposure correction using the 
approximated camera response function. For this reason, we store only 
the approximated camera response function for the entire panorama, 
which is shown in normalized form in Fig. 4(a). After vignetting and 
exposure correction in 3.2 and 3.3 the camera response function is used 
to convert the irradiance values back to intensity values. 


3.2 Estimation of the Vignetting Model 


To compensate for vignetting, we found that in our case a model 
can be sufficiently created by approximating the vignetting by the 
cosine-fourth-power law. This estimates the radial irradiance falloff 
at the boundaries of the camera images. To get a better result for the 
panoramic image we use a spherical camera model in our approach, 
that is described in more detail in [1]. As with the pinhole camera 
model, the intrinsic parameters can be specified in the matrix A as in 
Eq. 1, where f denotes the focal length and (ug, vo) describes the prin- 
cipal point. Another advantage of the spherical camera model is that 
the pixel coordinate is proportional to the angle of incidence. This re- 
sults in Eq. 2, which models the vignette as a function of the distance to 
the principal point r and the focal length f. In addition, the variables a 
and b are used to fit the vignetting model. The values are determined 
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> 
1 


(a) Approximated and normalized (b) Normalized vignetting model in the spher- 
camera response function of the ical camera frame. 
UNICARagil sensor setup. 


PS 


y 


Normalized Intensit 
ol 


0 0.5 
Normalized Irradiance 


Figure 4: Both the camera response function (a) and the vignetting model (b) are com- 
puted before the actual runtime, to be applied to the camera images afterwards. 


empirically based on a sequence recorded with the UNICARagil sensor 
module and result in our case in a = 3.4 and b = 0.1. In Fig. 4(b) the 
normalized vignetting model in the spherical camera frame is shown. 
Just like the camera response function, the vignetting model is also 
created before runtime to achieve real-time capability. 


g(u,v) =acos*(r/f) +b, 
r= y(u — uo)? + (0 — vo)?, a, b € Rf 


(2) 


3.3 Exposure Correction 


After the vignetting model is applied, the brightness between the 
stitched images is not fully adjusted due to a difference in exposure, as 
can be seen in Fig. 3(b). To perform exposure correction, an individ- 
ual exposure compensation factor c is determined for each image. For 
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this purpose, we calculate the quotient c,.; from the cumulative irradi- 
ances E along the vertical seam in the overlapping region for a pair of 
stitched images in Eq. 3. Based on the quotient c,.;, we derive in the 
following step the individual exposure correction factors for the single 
images in Eq. 4 by an additional constraint to ensures that the average 
of the factors is equal to one. If a panorama consists of more than one 
seam n, several quotients c,.]„ are obtained, from which the individual 
exposure factors can be calculated. For multiple stitched images with 
vertical seams n, we use the transitional condition Cyjgntn = Clefin41- IN 
Fig. 3(c) the stitching result after exposure correction is shown. 


(3) 


Cleft = Crop +1’ Cright = (4) 


4 Experimental Results and Evaluation 


We evaluate the presented approach on the UNICARagil sensor setup 
shown in Fig. 1 as well as on the sequences 1 to 10 of the nuScenes 
dataset [8]. In the latter case, we use the images from the front-facing 
camera and the front-left-facing camera to create a panorama. Since 
we do not know what kind of cameras are used in the nuScenes setup 
and we cannot reconstruct the camera response function from the avail- 
able data, we assume a linear response function as an approximation. 
Besides qualitative results in Fig. 3, we show the performance of our 
method using two different metrics in 4.1 and 4.2. First, the intersection 
over union of the histograms in the overlapping region of the stitched 
images is used. The second metric used is the mean absolute error of 
intensity differences in the overlapping region. Furthermore, we ana- 
lyze in 4.3 the runtime of the vignetting compensation and exposure 
correction for stitched images and show its real-time capability. 
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4.1 Intersection over Union of Histograms 


To measure the accuracy of image stitching, we compare the histograms 
of the two overlapping regions of the single images. This is done be- 
fore and after vignetting and exposure correction to show the improve- 
ment due to our approach. For a better comparison, the images are 
converted to 8-bit grayscale so that the histogram values are between 0 
and 255 with a bin size of 1. The similarity of two histograms H; can be 
measured by calculating the intersection over union (IoU). The IoU be- 
tween two histograms is calculated according to Equation 5. To prevent 
the resulting panorama from being extremely over- or underexposed, 
only pixels intensities with values unequal 0 and 255 are considered 
for evaluation. Table 1 shows the average IoU values of the histograms 
before and after vignetting and exposure correction on the recording 
with our UNICARagil sensor setup and on the nuScenes dataset. The 
increasing IoU using our approach shows that the histograms of the 
two overlapping regions are better aligned than without our approach. 


6) 


Table 1: Comparison of the average IoU of the histograms from the overlapping areas of 
the stitched images. The stitching quality is compared between using only raw 
images to processed images using our approach for vignetting and exposure 
correction for the two different image sequences. 


Raw images Processed images 
UNICARagil 46.29 % 55.94 % 
nuScenes 38.43 % 46.45 % 


4.2 Mean Absolute Error 


Since identical histograms can be derived from different images, we 
additionally evaluate the local similarity between pixel intensities with 
the mean absolute error. Compared to Zheng et al. we do not measure 
the difference to a ground truth vignetting function [3]. Instead, we 
also use the overlapping regions of the single images and calculate 
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the mean absolute error as another similarity measure to evaluate the 
image stitching performance. Thereby, we convert the images to 8-bit 
grayscale and calculate the mean of the absolute differences pixelwise, 
as shown in Equation 6. Similar to 4.1, we use only pixel pairs with 
values unequal 0 and 255 for evaluation. The mean absolute error is 
calculated by dividing by the number of pixel pairs and averaging it 
by the number of samples in the sequences. The results before and 
after vignetting and exposure correction are depicted in Table 2 for the 
sequence recorded with the UNICARagil sensor setup as well as for the 
nuScenes dataset. The evaluation clearly shows that the mean absolute 
error decreases if our approach is applied to the images. 


height 
Diet" Coch IStepı(u,0) — Srigne (U,)| 


u:v 


MAE = (6) 


Table 2: Comparison of the average mean absolute error of the pixel intensities from the 
overlapping areas of the stitched images. This allows the comparison between 
using only raw images to processed images using our approach as in 1. 


Raw images Processed images 
UNICARagil 21.58 7.71 
nuScenes 37.56 10.99 


4.3 Runtime Analysis 


In addition to the metrics, which show an improvement in accuracy, we 
evaluated the real-time capability of our approach. To optimize our ap- 
proach in terms of its execution time, we run the processing operations 
directly on the graphics card. This is an option as soon as the entire 
image processing in an autonomous vehicle is performed on the graph- 
ics card since copying data to and from the graphics card takes a lot of 
time. This offers further advantages, for example, for object detection 
with machine learning. The improved runtime is an exceptional fea- 
ture of our simplified approach to vignetting and exposure correction 
compared to the approaches in [3,5]. In Table 3, we compare the aver- 
age runtimes of vignetting compensation and exposure correction on 
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CPU and GPU. On the computer for evaluation, we use Ubuntu 18.04.6 
LTS as operating system. As CPU an Intel® Xeon® Prozessor E5-2640 v3 
running at 2.6 GHz with 64 GB of RAM and as GPU a NVIDIA GeForce 
RTX 2080 Ti are installed. The table clearly shows that we achieve real- 
time capability at a frame rate of 10 Hz with an average processing time 
of 31.36 ms by using the GPU. Further improvements can be expected 
on the latest generation of NVIDIA graphics cards. 


Table 3: Comparison of the average runtimes of our approach on vignetting and expo- 
sure correction on CPU and GPU. 


CPU GPU 
Runtime in ms 155.35 31.36 


5 Conclusion 


In this paper we presented a straightforward and effective method on 
vignetting and exposure correction for multiple camera images and 
image stitching. Our approach relies on a known camera response 
function and a previously estimated vignetting model that are applied 
on the images to be stitched. First, the irradiance is calculated from in- 
tensity using the inverse camera response function and our vignetting 
model is applied. Then, the optimal exposure correction factors for 
the single images are estimated from the pixels at the seam to improve 
the quality of the panorama. After vignetting and exposure correc- 
tion, the intensities are obtained from the modified irradiance values. 
In summary, the vignetting of the single images is compensated and 
the transition at the seam of the panorama due to different exposure is 
corrected. We evaluated our approach by calculating the IoU between 
the histograms of the overlapping regions of the stitched images before 
and after vignetting and exposure correction and have clearly demon- 
strated that the IoU increases significantly after applying our approach. 
In addition, we have shown that the mean absolute error of the over- 
lapping regions after vignetting and exposure correction also decreases 
strongly. Both quantitative results confirm the significant improvement 
in image stitching quality after using our approach. This can lead to 
higher precision in object detection and other perception tasks. Finally, 


111 


C. Kinzig, G. Feng, M. Granero, and C. Stiller 


we 


have also shown that our approach can be executed on a graphics 


card in real-time. To further extend our approach, we plan to integrate 
joint optimization of exposure correction factors for multiple seams of 
a full 360° horizontal panoramic image in the future. 
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Abstract Sensor-based sorting provides state-of-the-art solu- 
tions for sorting of granular materials. Current systems use 
line-scanning sensors, which yields a single observation of each 
object only and no information about their movement. Recent 
works show that using an area-scan camera bears the potential 
to decrease both the error in characterization and separation. Us- 
ing a multiobject tracking system, this enables an estimate of 
the followed paths as well as the parametrization of an indi- 
vidual motion model per object. While previous works focus 
on physically-motivated motion models, it has been shown that 
state-of-the-art machine learning methods achieve an increased 
prediction accuracy. In this paper, we present the development 
of a neural network-based multiobject tracking system and its 
integration into a laboratory-scale sorting system. Preliminary 
results show that the novel system achieves results comparable 
to a highly optimized Kalman filter-based one. A benefit lies in 
avoiding tiresome manual tuning of parameters of the motion 
model, as the novel approach allows learning its parameters by 
provided examples due to its data-driven nature. 


Keywords Sensor-based sorting, machine learning, visual in- 
spection, multiobject tracking 
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1 Introduction 


Sensor-based sorting provides state-of-the-art solutions for sorting of 
granular materials. This umbrella term describes a family of systems 
that enable the physical separation of individual objects from a material 
stream on the basis of information acquired by one or multiple sensors. 
Among other fields of application, it is considered a key technology 
for achieving a circular economy. In distinction to mechanical sorting 
processes such as screening, wind sifting, or float/sink processes, the 
technology is sometimes also referred to as indirect sorting [1], since 
particle classification and separation are performed in separate steps. 
In theory, any number of classes can be recognized for sorting, and 
separation into multiple fractions is also possible in principle. In in- 
dustrial applications, however, the task is preferably implemented as 
a binary sorting task, i.e., sorting into “product” and “residue”, since 
multi-way sorting requires complex mechanical handling. 

The functional principle can be summarized as follows. First, the ma- 
terial is fed into the system by means of a conveyor mechanism. Subse- 
quently, the material is transported further via a transport medium. In 
the course of the transport, sensor-based data acquisition takes place. 
The data collected is evaluated with the goal to detect and classify 
individual particles in the material stream. The result of the classifica- 
tion is the basis for the sorting decision, which is executed by means 
of an actuator. A particular strength of the sorting technology lies in 
the variety of industrially available sensors that are suitable for use in 
sensor-based sorting systems. This results in great flexibility with re- 
gard to the detectable material properties and thus the sorting criteria 
to be applied. Due to their suitability for systems with high material 
throughputs, imaging sensors dominate at this point. 


1.1 Motivation 


Current systems use line-scanning sensors, which is convenient as the 
material is perceived during transportation. In case sorting criteria 
based on color, shape or texture suffice, line-scan cameras in the visible 
spectrum are used. However, this yields a single observation of each 
object only and no information about their movement. Due to a delay 
between localization and separation, assumptions regarding the veloc- 
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ity need to be made in order to calculate the location and point in time 
for separation [2,3]. Hence, it is necessary to ensure that all objects are 
transported at uniform velocities. This is often a complex task. 

Recent works show that using an area-scan camera instead of a line- 
scanning one bears the potential to decrease both the error in charac- 
terization [4] and separation [5] in sensor-based sorting. Using a suf- 
ficiently high frame-rate, individual objects are observed at multiple 
time points. By employing a multiobject tracking system, this enables 
an estimate of the followed paths as well as the parametrization of an 
individual motion model per object. The latter allows for accurate pre- 
dictions regarding which actuators need to be activated at what point 
in time such that an object is deflected and hence removed from the ma- 
terial stream. Therefore, the approach is also referred to as predictive 
tracking. Eventually, this results in an increased sorting quality. 

While previous works focus on physically-motivated motion mod- 
els, it is shown in [6] that state-of-the-art machine learning methods 
provide a powerful tool for achieving an increased prediction accuracy, 
particularly in complex sorting scenarios. However, the approach has 
not been evaluated in real sorting experiments yet, but rather using 
pre-recorded image data and a simulated separation. 


1.2 Contribution 


In this paper, we present the development of a neural network-based 
multiobject tracking system and its integration into a laboratory-scale 
sorting system with an area-scan camera. This is the first time that the 
complete development cycle required to make such machine learning- 
based methods applicable in an industrial sorting setting is considered. 
With respect to the data processing model itself, we consider the multi- 
layer perceptron from [6]. This model takes observation coordinates of 
individual objects, which in our case are determined by means of real- 
time image processing, as an input and generates the predictions for 
future time points, in our case for the separation stage, as an output. 
Eventually, actual sorting experiments using the neural network-based 
multiobject tracking system are conducted. 
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2 Materials and Methods 


In the following, we provide details on the experimental setup, e. g., 
the exemplary sorting scenario and the considered sorting system, the 
different prediction models that are compared experimentally as well 
as the implementation of the real-time inference engine. 


2.1 Experimental Setup 


We choose an exemplary sorting scenario from the field of construc- 
tion waste recycling. By generating pure fractions from construction 
and demolition waste, the material is prepared for the production of 
recycled construction materials [7]. In our scenario, we consider an in- 
put stream consisting of sand-lime brick and brick, see Figure 1. The 
task is to remove brick from the waste stream. The material is crushed 
to a grain size of 4 to 6 mm prior to sorting. 


r f A 
oe 4 
(b) Brick (c) Mixed material 


+£ 


(a) Sand-lime brick 
Figure 1: Photos of the materials used for the exemplary sorting task. 


Both for the acquisition of training data as well as the experimental 
validation, we use the lab-scale sorting system shown in Figure 2. A 
detailed description of the system is provided in [5]. A vibrating feeder 
is used to feed the material in the system. For transportation, a con- 
veyor belt with a width of 140 mm is used. At the end of the belt, right 
before discharge, the material stream is recorded using an area-scan 
camera in combination with a ring light. After discharge and during 
a flight phase, separation is performed using fast switching pneumatic 
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valves. 


Figure 2: Photo of the lab-scale sorting system used in this study. 


The acquired image data is processed with the aim of localizing and 
classifying individual particles. Based on the classification, a sorting 
decision is calculated. In case a particle is to be removed from the ma- 
terial stream, a control signal is calculated and transmitted describing 
the time as well as the valves to be activated in the array. Exactly this 
calculation, referred to as the prediction model in the following, is the 
subject of the present study. 


2.2 Prediction Models 


We validate the proposed approach comparatively. Hence, we also con- 
sider two established prediction models for the calculation of the con- 
trol signals for separation. 

First, as a base-line, we consider the system to be equipped with a 
line-scan camera instead of an area-scanning one. This corresponds to 
a setup as used in the industry at the time of writing. In this case, 
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no information regarding particles’ motion is known. Consequently, a 
uniform transport velocity has to be assumed. A fixed, typically exper- 
imentally determined delay is added to the point in time of observation 
of a particle in order to calculate the temporal component of the pre- 
diction. Furthermore, it is assumed that no velocity perpendicular to 
transport direction exists. Hence, the valves to be activated correspond 
to the lateral position of the particle as seen by the camera. 

Second, we consider the approach originally proposed in [8] and 
experimentally validated in [5]. By using a high-speed area-scan cam- 
era, particles contained in the material stream are observed at multiple 
points in time and tracked via a multiobject tracking system. This way, 
motion parameters, e. g., the velocity in and perpendicular to transport 
direction, can be determined individually for each particle. In com- 
bination with a motion model, these parameters are used to precisely 
estimate the control signal for separation. The approach focuses on 
applying Kalman filters on the centroid of the particles for predictive 
tracking. In this course, linear, physically motivated models, such as 
constant velocity (CV), are used. 

The novel data-driven approach experimentally validated in this pa- 
per takes the last five captured position measurements of each particle 
as input and directly outputs the control signal for separation, i.e., the 
estimated arrival time and location of the particle at the separation bar. 
This is opposed to the original predictive tracking algorithm, which for 
this purpose uses the estimated positions and velocities from the un- 
derlying Kalman filter. The input measurements are provided by the 
exact same multiobject tracking system employed in the original pre- 
dictive tracking setup. The approach uses a multilayer perceptron with 
four hidden layers as a predictor, where each hidden layer consists of 
16 neurons. Further details on the architecture and training procedure 
are given in [6]. 

While numerous tools and software frameworks are now established 
for model development, the use of neural networks in production sys- 
tems and, in the present case, under real-time conditions still repre- 
sents a very current research topic. In the course of this study, various 
frameworks for integration into the sorting system were investigated 
in a first step. A technical constraint was the use of the programming 
language C++. After a first research, the frameworks TensorRT from 
NVIDIA and OpenVino from Intel were chosen. These frameworks dif- 
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fer fundamentally in the target hardware on which the inference is 
executed. TensorRT allows the execution of the inference on dedicated 
NVIDIA graphics cards, OpenVino on Intel CPUs as well as integrated 
Intel GPUs. In both cases, conversion of the model was necessary prior 
to any potential application. Onnx was identified as the current sup- 
posedly universal format for this purpose. 

In addition to training the developed model on the basis of the gen- 
erated image sequences, it was also necessary to take knowledge about 
the system structure into account in the implementation, see Figure 3. 
Here, parameters relating to the separation, such as the distance be- 
tween the camera observation area and the separation bar, were pri- 
marily decisive. To compensate for errors potentially arising due to 
measurement inaccuracies, parameters for manual configuration of an 
offset, e. g., with regard to the distance, were implemented. 


Acquisition of image sequences Image processing to generate Model development, 

on the sorting system the data basis training and validation 

“> 7% 
® 


Knowledge about system design 


Figure 3: Schematic illustration of the development process of the machine-learning 
based multiobject tracking. 


3 Experimental Validation 


We conduct sorting experiments using the methods and materials de- 
scribed in Section 2. One experiment corresponds to sorting 200 g of 
the material in a batch manner. Additionally to the three prediction 
models described in Section 2.2, three different mixing ratios are in- 
vestigated. More precisely, we consider ratios of residue, i.e., brick, of 
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10 %, 25 % and 50 %. Furthermore, we conduct experiments with a 
mass flow of 10 g/s and 20 g/s. 


3.1 Model Training 


The multilayer perceptron was trained on a data set of particle tracks 
recorded on the lab-scale sorting system described in Section 2, with 
tracks obtained by a preceding offline run of the multiobject tracking 
algorithm. Although we test the novel approach on several mass flows 
and mixing ratios in this paper, the multilayer perceptron was trained 
on only one specification, a mass flow of 20 g/s with a ratio of brick 
of 25 %, where we used the tracks of both brick and sand-lime brick 
for training. Images were captured at a frame rate of 100 Hz. The belt 
velocity was approximately 1 m/s. 

The ground truth for the particle’s arrival time and location was gen- 
erated using the concept of a virtual separation bar (see [6,8]), since their 
exact values are not accessible due to the lack of a camera capturing 
the scene at the separation bar and the limited temporal resolution of 
most cameras. For this reason, only the images of the area-scan camera 
are used for training. Therefore, the prediction is performed with re- 
spect to a specific pixel row in the camera image corresponding to the 
virtual separation bar and the tracking phase is shortened accordingly. 
In addition, the coordinate system for the measurements is shifted so 
that the virtual separation bar coincides with the real one. The ground 
truth is then obtained by linear interpolation between the last measure- 
ment before and the first measurement after the virtual separation bar. 
For deployment, the trained network is applied to the original configu- 
ration and fed with non-shifted measurements. Although this concept 
introduces some inaccuracies due to interpolation errors and the as- 
sumption of similar particle motion on the belt and in the flight phase, 
it offers the benefit of not requiring additional sensors and allowing the 
network to be trained in an unsupervised fashion without additional 
costs for manually labeling the ground truth. 


3.2 Experimental Results 


The true negative rate (TNR) and true positive rate (TPR) were determined 
as performance indicators for the sorting quality. The TNR refers to the 
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proportion of residue material that has been successfully removed, and 
the TPR to the proportion of product material that has successfully been 
accepted, i. e., not been removed. A selection of the results obtained is 
shown in Figure 4. The individual markers represent the result of an 
individual experiment. 

As can be seen from Figure 4, the preliminary results show that 
the novel system achieves results comparable to a highly optimized 
Kalman filter-based one, although it does not outperform it. However, 
considering the early stage of development and the opportunities for 
increasing performance, e. g., by means of training data, we consider it 
a promising future research direction. An already gained benefit lies in 
avoiding tiresome manual tuning of parameters of the motion model, 
as the novel approach allows learning its parameters by provided ex- 
amples due to its data-driven nature. 


4 Conclusion 


In this paper, we presented the experimental validation of a novel neu- 
ral network-based multiobject tracking system. For this paper, we im- 
plemented and integrated the system for use with a laboratory-scale 
sorting system that was equipped with an area-scan camera. We com- 
pared the performance to ones achieved using a line-scan-based system 
as well as a multiobject tracking system with physically-motivated mo- 
tion models. Preliminary results show that the novel system achieves 
results comparable to a highly optimized Kalman filter-based one, al- 
though it does not outperform it yet. However, an advantage of the 
novel system lies in avoiding tiresome manual tuning of parameters of 
the motion model. 

Considering the early stage of development of the system, we be- 
lieve there exist various interesting research directions to boost its per- 
formance. Great potential is believed to lie in the expansion and sys- 
tematic selection of training data. Furthermore, a system combining 
physically-motivated as well as machine learning-based models as de- 
scribed in [6] is of great interest. 
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(a) Mass flow 10 g/s. 


£ = 
2 . g 
z E 
© = 
2 3 
S . a 
© @ Line-scan g 
E © Tracking (CV) E 
@ Tracking (MLP) 
10 25 50 10 25 50 
ratio reject in % ratio reject in % 


(b) Mass flow 20 g/s. 


Figure 4: Results of the sorting experiments using the three different prediction models 
in terms of TNR and TPR. The individual markers represent the result of an 
individual experiment. 
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Abstract In this contribution we show our approach for a fea- 
ture rich and high speed BLOB analysis on FPGAs. For the 
Hybrid-BLOB concept we use a combination of a single-pass 
BLOB analysis and a double-pass labeling algorithm. We use 
Basler’s VisualApplets for the implementation of the concept on 
their microEnable 5 frame grabbers. We achieve the extraction 
of the gray value data of the BLOBs at factor 14 higher frame 
rates compared to the naive labeling of the complete image. This 
is achieved by limiting the maximum BLOB size to 128 x 128 px, 
which speeds up the double-pass labeling algorithm. Our con- 
cept is targeted at low latency and high throughput demanding 
applications where BLOBs are small, like sensor based sorting or 
surface inspection. 


Keywords Image signal processing, FPGA, BLOB analysis 


1 Introduction 


In image processing the term Binary Large Object (BLOB) analysis 
refers to the extraction of connected components of a binary image with 
posterior calculation of the component's features like area, circumfer- 
ence, etc. The features are often used to classify these components. 
They are often called objects, as in many applications single objects are 
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segmented and analyzed. In inspection tasks these algorithms may be 
used for the classification of single objects, e.g. into “accept” or “reject” 
classes or to divide the defects even further, for example into “dent”, 
“scratch”, etc. 

In image processing Field Programmable Gate Arrays (FPGAs) are 
used if high throughput, low latency or energy efficiency is demanded. 
For example FPGAs are used directly in cameras for post processing of 
the sensor data. They are also used in special applications like sensor 
based sorting or surface inspection. 

MSTVision developed an FPGA based sensor based sorting plat- 
form, which aims at minimum latencies [1]. Its logic is completely 
implemented with VisualApplets (VA). VA is a proprietary develop- 
ment platform by Basler (formerly Silicon Software) for FPGA image 
processing logic development, tailored for their frame grabbers and 
devices with embedded VA support [2]. The platform proved its low 
latencies of around 200 us in [3]. Currently the system’s image pro- 
cessing capabilities are limited by the feature limits of the VA BLOB 
analysis operator. 

To run the mentioned tasks on FPGAs, implementations of the BLOB 
analysis are required. The research field in FPGA based BLOB analysis 
algorithms is still active. To extract BLOB features, first the connected 
components need to be extracted. This process is named labeling, its 
output is an intermediate image, with unique pixel values for each 
connected component in the image. There are many algorithms, but 
the algorithms may be divided in four categories [4, p. 352-359]: 


1. Single-pass algorithms, where the data only needs to pass the 
computing pipeline once. 


2. Double-pass algorithms, where the data needs to pass the com- 
puting pipeline twice. 


3. Multi-pass algorithms, where the data needs to pass the comput- 
ing pipeline multiple times, depending on the image content. 


4. Random-access algorithms, where the data needs to be accessed 
randomly. 


Each algorithm category poses its own pros and cons. Most of the 
current research focuses on single-pass algorithms, as they provide the 
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lowest possible latencies and demand only small FPGA resource quan- 
tities. The BLOB labeling is done only implicit. The downside is the 
limited amount of extractable features, which we will explain in the 
next paragraph. We will focus on single- and double-pass algorithms, 
as they are used in this contribution. 


1.1 Labeling problem in detail 


The main problem for image stream labeling algorithms are “U” 
shaped components, for example see fig. 1. While processing the bi- 
nary image stream, the first object pixel is observed at (1,4). A new 
label is created for a unique representation of the object. In the next 
image line at (2,1) another object pixel is observed, but based on the 
processed data, it’s not connected with the ones in the line before. A 
new label is created. While scanning the line, both labels coexist. In 
line 3 both labels turn out to be connected at (3,3) or (3,4), depending 
whether the 4-connected or 8-connected neighborhood is used. This 
results in a problem: the previously assigned labels need to be merged 
into one. The way the algorithms overcome that problem is their fun- 
damental difference. 

Double-pass algorithms like [5, p. 4] use equivalence tables to record 
these conflicts. One object may consist of many intermediate labels. 
After the first labeling pass, a conflict resolving algorithm is used to 
convert the labels to a unique final label lookup table (LUT). With the 
LUT and the result image of the first pass, the final label image is cre- 
ated. The advantage over single-pass algorithms is the ability to extract 
the component pixel accurately. This enables the calculation of all ob- 
ject features after labeling. The disadvantages are their higher memory 
demands for buffering the intermediate label image and the equiva- 
lence table. Resolving the label conflicts and calculation of the features 
after the labeling adds computing time. For FPGA implementations the 
often required random memory accessibility for the equivalence table 
is a limitation, too. 

Single-pass algorithms like [6] don’t provide a label image output, 
instead they calculate the object features directly. The labeling is only 
carried out internally. A single-pass algorithm performing the extrac- 
tion of the object area would work on the example in fig. 1 as follows: 
the first object pixel is observed at (1,4), anew temporary label and ac- 
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cumulator is created. For each connected pixel the area is incremented 
by 1. In the next image line at (2,1) another object pixel is observed, 
another temporary label and accumulator is created. When both labels 
collide, one label is deleted and its area accumulator is added to the 
other accumulator. The output of the algorithm is a list of component 
features, in this example only the area. There is no ability to extract 
the object pixels to compute other features. The features which may be 
extracted are limited to those which may be merged out of the values 
of sub component features on label collision. Their advantages are the 
small memory requirements, which is limited to the feature and label 
table, and the smaller computing time. 

With single-pass algorithms features like the oriented bounding box 
or minimum/maximum Feret diameters can’t be calculated. These fea- 
tures are usually calculated with the object’s convex hull and the rotat- 
ing calipers algorithm. [7] 


Figure 1: A simple “U” shaped object to demonstrate the main challenge of streaming 
labeling algorithms. Modified version from [8, Fig. 3] 


2 Method 


To fill the gap between single- and double-pass algorithms, we devel- 
oped the Hybrid-BLOB concept. The method consists of two BLOB 
analysis/labeling algorithms, a single-pass algorithm and a double- 
pass algorithm. The single-pass algorithm is the one used in the VA 
BLOB analysis operators [9]. The double-pass algorithm is our imple- 
mentation of the algorithm described in [8]. 

The double-pass algorithm is expensive with respect to computing 
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time and memory if applied to a big image. For big images, the con- 
flict resolve table won’t fit into the FPGA’s on-chip memory of current 
Basler frame grabbers, requiring utilizing the off chip DRAM. Resolv- 
ing the conflicts is an algorithm of quadratic order. The single-pass 
algorithm in comparison does only provide a few features. 

To overcome the limitations, both algorithms are combined, as de- 
scribed in the next subsection. This allows smaller input image sizes 
for the double-pass algorithm, thus the conflict resolve table fits into 
the FPGA’s on-chip memory and the conflict resolve algorithm may 
run faster. 


2.1 Architecture overview 


In fig. 2 the concept is shown. The image input is preprocessed and 
segmented. The single-pass BLOB analysis of VA is applied to the 
segmented image. In parallel, the segmented image and the prepro- 
cessed gray image are stored into dynamic random access memory 
(DRAM). A pre-classification is applied to the output of the single-pass 
BLOB analysis. The remaining objects of interest are retrieved from 
the DRAM buffer via their bounding box information. The extracted 
object images may contain pixels of other objects, as shown in the ex- 
ample BLOBs in fig. 2. The double-pass algorithm is then used to label 
the small images. With the bounding box information of the previ- 
ous BLOB analysis and the label image, the corresponding object may 
be extracted from the binary and gray image. Afterwards we extract 
various object features which then may be used for object classification. 


2.2 Implementation 


The implementation is done in VA with only VA operators except one 
VHDL custom operator. The target hardware platform are the mi- 
croEnable 5 marathon frame grabbers, [11]. As the implementation 
of most of the single architecture elements is straightforward, we fo- 
cus on the double-pass labeling algorithm and the feature extraction. 
For comparison we use an implementation of the labeling algorithm 
for the labeling of a whole 1024 x 1024 px image. The maximum con- 
figurable bounding box size for labeling in Hybrid-BLOB is limited to 
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Figure 2: Hybrid-BLOB architecture overview. Modified version from [10, Fig. 4.1] 
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Table 1: Comparison of memory requirements of the implementations. The maximum 
code length was empirical determined. id is the label id, eq is the equivalent 
label id in case of a conflict, r is the run element’s row, s and e are start and end 
column of the run. BRAM is for Block Random Access Memory, the FPGA’s 
on-chip memory. [10, Tab. 4.1] 


Parameter Labeling Reduced Labeling 
id, eq 16 Bit 8 Bit 
r, s,e 13 Bit 7 Bit 
Memory per element 10 Byte 5 Byte 
Max. label count 65535 255 
Max. code length 65535 4095 
Maximum memory 5.24 MBit 163.8 kBit 
BRAM-Elements (18 kiB per element) 291.3 9.1 


128 x 128 px. The labeling stage uses fixed frame size inputs of the con- 
figured maximum bounding box size. The design transfers the input 
image and an image with the BLOB features over Direct Memory Acess 
(DMA) channels. 


Labeling The labeling algorithm is an implementation of [8]. The 
algorithm is a run length encoding (RLE) based, 4 connected neigh- 
bourhood type. Depending on the design, other bit depths are used for 
the labels and the run length code. Labeling smaller images results in 
smaller coordinate bits and fewer possible labels. In tab. 1 the resource 
occupation for both variants are shown. By reducing the image size 
which has to be labeled, the required memory drops, which practically 
enables the storage of the run length data in the FPGA’s on-chip mem- 
ory. For the labeling of the whole image, the data is stored in the frame 
grabber’s DRAM. The whole image labeling design does not contain 
the calculation of features. 


Feature extraction With the extracted object’s image data, the fea- 
ture extraction takes place. The extraction is completely integrated 
into the FPGA. The orientated bounding box and Feret features are 
not calculated with the convex hull and rotating calipers. They are ap- 
proximated by discrete object rotations in angle steps of 0.703°. To save 
FPGA resources, the calculation of unneeded features may be removed. 
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The extracted features are: 


e VA-Operator: bounding box, area, center of gravity (output of single- 
pass analysis stage). 


e Gray Value: min, max, mean, std, median, upper and lower quartile, 
difference to a reference histogram (rel/abs). 


e Other binary image features: Euler’s number, circumference, com- 
pactness, circularity, circle equivalent diameter. 


e Binary image moments: 2nd and 3rd order. 


e Ellipse features: main axis angle, main and minor axis radius, eccen- 
tricity. 


e Gray image moments: 2nd and 3rd order. 
e Oriented bounding box: area, angle, width, height. 
e Feret diameter: minumum, maximum, min. angle, max. angle. 


For further information about the features, we suggest [12], [13], [14], 
[15] and [16]. 


3 Results 


Both test designs have been built with VisualApplets 3.3.1 for the mi- 
croEnable 5 Marathon VCLx frame grabber, running at a frequency 
of 125 MHz [17] [18]. The used frame grabber runtime is version 5.7. 
In fig. 3 our test image is shown. It contains 1161 BLOBs and has a 
resolution of 1024 x 1024 px. The amount of objects is not representa- 
tive for real applications. The image is uploaded to the FPGA DRAM 
and repeatedly processed for our tests. We use the shown frame rate 
of microDisplay, the runtime application used to configure the frame 
grabber. The frame rates are validated against debug registers on the 
FPGA. The BLOB frequency is measured with a debug register. For 
our measurements, we don’t filter the output of the single-pass stage. 
The BLOB count varied from 1143 to 1161 while testing. The reason 
of these variations is currently unknown, their origin is the single-pass 
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Table 2: Measurement results for processing the image shown in fig. 3. 


Parameter Full Labeling Hybrid-BLOB 

Frame rate 1.25 Hz 17.0 + 0.5 Hz 
Mean BLOB frequency 15kHz 19.8+0.010kHz 
Mean time per BLOB 689 us 51 ps 
Labeling throughput 1.3 Mpx/s 324.4 Mpx/s 


stage. We use 1161 as BLOB count for calculating the mean time of the 
labeling design and the period of the BLOB frequency for the Hybrid- 
BLOB design. In tab. 2 our results are shown. Our Hybrid-BLOB con- 
cept runs at 14 times higher frame rates even with the 250 times higher 
data throughput in the double-pass stage. Due to the fixed frame size 
for bounding box extraction, the labeling overhead increases if many 
small objects are present. 
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Figure 3: Test image used. It contains 1161 objects at a resolution of 1024 x 1024 px. 
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4 Conclusion 


We have shown an approach to speed up a feature rich BLOB analysis 
on FPGAs. The implementation with VisualApplets enables the usage 
on Basler frame grabbers of the current portfolio and possible future 
platforms supporting VisualApplets. Hybrid-BLOB processes in our 
test scenario 19.800 BLOBs per second, which allows its usage in the 
field of granule sorting. To increase the throughput further, the label- 
ing and feature extraction stage may be implemented multiple times 
in parallel. Our concept may be used in traditional PC based image 
processing, too. 

The throughput and latency may be further improved if the double- 
pass labeling algorithm is extended to support variable image input 
sizes for overhead reduction. If variable input sizes are used, the run 
length encoding stage runs faster and the count of runs to label de- 
creases. We expect big improvements if small BLOBs are processed, as 
the measured overhead is 250 times compared to the image input. 
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Abstract The size of images and data we process every day have 
been growing exponentially over the last years. Quantum com- 
puters promise to process this data more efficiently. Experi- 
ments on quantum computer simulators prove the paradigms 
this promise is built on to be correct. However, currently, run- 
ning the very same algorithms on a real quantum computer is 
often too error prone to be of any practical use. We explore the 
current possibilities for image processing on real quantum com- 
puters. We redesign a commonly used quantum image encoding 
technique to reduce its susceptibility to errors. We show experi- 
mentally that the current size limit for images to be encoded on 
the quantum computer and subsequently retrieved with an error 
of at most 5% is 2 x 2 pixels. A way to circumvent this limitation 
is to combine ideas of classical filtering with a quantum algo- 
rithm operating locally, only. We show the practicability of this 
strategy using the application example of edge detection. Our 
hybrid filtering scheme’s quantum part is an artificial neuron, 
working well on real quantum computers, too. 


Keywords Quantum image processing, quantum image encod- 
ing, quantum edge detection, quantum artificial neurons, IBM 
Quantum Experience 
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1 Introduction 


In this contribution, we do not discuss quantum imaging methods. 
Throughout, we assume the image data to be processed on a quantum 
computer to be given as a classical gray-value image. Thus, first, we 
have to encode the gray-value information into quantum states. There 
are basically three concepts for this encoding, namely basis encoding, 
phase encoding, and amplitude encoding. Within the last years, several 
methods have been developed following these three basic concepts [1]. 
Here, we concentrate on the phase encoding method Flexible Represen- 
tation of Quantum Images (FROI) [2] and improve its implementation. 

After the encoding, we normally process the states by applying some 
algorithms. Initially, algorithms were only formulated in theory or 
executed on simulators of quantum computers. Only since 2016, it 
has also been possible to execute algorithms on real quantum comput- 
ers. A short overview of currently available algorithms is given in [3]. 
Here, we aim at algorithms that run on the actual quantum hardware. 
More precisely, we implement quantum image processing algorithms 
on IBM’s superconducting quantum computers [4]. 

This paper is organized as follows. Section 2 provides some ba- 
sics of quantum computing. In Section 3, we describe the experimental 
setup including the quantum computers, the software, and the classical 
computers used. We explain our improved version of the FROI image 
encoding in Section 4. In Section 5, we present the idea of hybrid quan- 
tum image filtering and highlight the performance for detecting edges 
in images with a quantum computer. Two variants of the quantum 
edge detector with 2D and 1D masks are detailed. Section 6 concludes 
the paper. 


2 Quantum computing basics 


Before diving into quantum image processing, we summarize some 
basic concepts of quantum computing [5]. Classical computing and 
quantum computing follow completely different paradigms, starting 
with the basic elements. Classically, everything builds on bits, that can 
attain either state 0 or 1. The quantum analogue are the quantum bits 
(qubits) — two-state quantum systems that allow for more flexibility. 
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Analogous to 0 and 1, there are two basis states of a qubit: |0) = (1,0)7 
or |1) = (0,1). However, any linear combination (superposition) 


ly) = «|0) +611), (1) 


of the basis states with «,ß € C and |a|? + |8|? = 1 defines a possible 
state, too. The overall phase of a quantum state is unobservable [5]. 
That is, |) and ef? |) for č € [0,271] define the same state. Hence, it is 
sufficient to consider a € R. 

As a consequence, the state of a single qubit can be visualized as a 
point on the unit sphere in R? (Bloch sphere) with spherical coordinates 
& and 6, where a = cos(0/2) and ß = e’? sin(@/2). All operations on 
a qubit must preserve the condition |a|? + |B|? = 1, and can thus be 
represented by 2 x 2 unitary matrices. Standard operations (so-called 
gates) acting on single qubits are 


x= (95). H-(} a) P6)= (9%). 2) 


where the X-gate acts like a classical NOT operator and the Hadamard 
gate (H) superposes the basic states of a single qubit. A qubit in su- 
perposition can be thought of as having all possible states at the same 
time. The Phase gate (P) rotates by @ about the z-axis of the Bloch 
sphere. Phase shift gates can be used to encode gray-values. 

Additionally, we need operations that link two or more qubits. The 
most common operation in quantum computing is the controlled NOT- 
gate (CX-gate) taking two input qubits. The target qubit’s state is 
changed depending on the state of the control qubit: 


1000 
0100 
CX=| g001|: (3) 


0010 


That means, if the control qubit is in state |1), then we apply an X-gate 
to the target qubit. Otherwise, we do nothing. For example, assume 
our two qubit system has the state |10) = |1) & |0}, where the first qubit 
is the control, the second the target qubit and & is the tensor product. 
Then, application of the CX-gate results in the state 


\11) = |1) & |1) = (0,0,0,1)7. (4) 
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So basically, the application of quantum gates can be formulated in 
terms of linear algebra. 

In general, we can apply any unitary operation to the target qubit. 
For example, a controlled-Phase gate applies a P-gate to the target qubit 
if and only if the control qubit is in state |1). We can also increase the 
number of control qubits even further. The operation with two control 
qubits and an X-gate applied to the target qubit is called Toffoli gate. 

Applying such controlled operations to two or more qubits with 
the control qubits in superposition, results in the entanglement of the 
qubits involved. In terms of linear algebra, an entangled state of sev- 
eral qubits is one that cannot be written as a tensor product of states 
of the individual qubits. Entanglement is exactly where we benefit 
from the quantum computing properties. Together with superposition, 
entanglement allows to use a logarithmically lower number of qubits 
compared to the number of classical bits. 

While in a classical computer all bits are connected to each other, in 
IBM’s quantum computer the qubits are arranged in a special, so-called 
heavy-hexagonal scheme (see the honeycomb structure in Figure 1). 
That is, each qubit is directly connected to at most three other qubits. 
To apply two qubit gates to unconnected qubits, the information has 
to be swapped to neighbouring qubits by application of additional CX- 
gates. Each CX-gate, however, increases the overall error considerably 
such that an algorithm should employ as few CX-gates as possible. 

Lastly, the readout is also completely different for classical and quan- 
tum computing. On classical computers, you can always read the cur- 
rent state of the bits, copy them, or just continue running an algorithm 
with the same state of the bits as before the readout. Unfortunately, 
this is not possible on quantum computers. First, according to the no- 
cloning theorem [5], a state cannot be copied. Second, when measuring 
(reading out the state of) a qubit, its state collapses to one of the basis 
states |0) or |1). Hence, continuing the algorithm after read out is not 
possible. Additionally, measuring a qubit does not immediately pro- 
vide the values of « and ß in Equation (1). However, the probability 
of collapsing to |0) is given by |a|* while the state |1) is obtained with 
probability |B|?. Repeated measurements (shots) of the same state al- 
low for an estimation of these probabilities and thus the values a and 
ß, too. For further reading on quantum computing basics we recom- 
mend [5]. 
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Figure 1: Coupling map of the backends used in this paper. Every circle represents a 
qubit, lines represent connections between the qubits. Colors code the readout 
errors (circles) and the CX-errors for the connections (lines). Dark blue indi- 
cates a small error, purple a large one. Errors are shown for 'ibmq-ehninger’. 
‘ibmq-toronto’ has the same coupling map, but errors differ slightly (see Ta- 
ble 2). 


3 Near-term quantum computers 


We use the open-source software development kit Qiskit [6] for work- 
ing with IBM’s circuit-based superconducting quantum computers [4]. 
They provide a variety of systems, also known as backends, which dif- 
fer in the type of the processor, the number of qubits (scale), and their 
connectivity [4]. Access is provided via a cloud. In this paper, we use 
two of the available 22 backends, ’ibmq-_toronto’ and ’ibmq-ehningen’ 
see Table 1. This choice is not crucial for our use case as we use a 
small subset of the qubits only and backends’ performance does not 
differ significantly. The coupling map, so the connections between the 
qubits, of the backend ‘ibmq-ehningen’ is shown in Figure 1. Addi- 
tional parameters describing the performance of IBM’s backends are 
quality (quantum volume) and speed (circuit layer operations per sec- 
ond, CLOPS). All parameters of the two used backends are summa- 
rized in Table 1. 

Besides the coupling map and the above listed performance values, 
external conditions influence the backends. Thus, compared to classi- 
cal computers, the basic operations of quantum computers yield quite 
large errors. E. g., applying a couple of gates or performing measure- 
ments is currently quite noisy with errors that can change hourly. Typ- 
ical average values for CX error, single qubit gate error, and readout 
error, are shown in Table 2. Additionally, Table 2 shows the decoher- 
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Table 1: Processor type and actual performance of the used backends as measured in 


September 2022. 
Backend Processor Scale Quality Speed 
type [# qubits] [QV] [CLOPS] 
‘ibmgq-toronto’ Falcon r4 27 32 2.800 
‘ibmq-_ehningen’ Falcon r5 27 64 1.900 


Table 2: Typical average calibration data of the two chosen backends. The values are 


from September 2022. 
Backend CX-error Single qubit Readout error T1 T2 
gate error 
[%] [%] [%] [vs] [ps] 
‘ibmq-_toronto’ 5.34 0.051 3.66 103.71 107.72 
‘ibmq-_ehningen’ 0.71 0.024 1.05 151.74 160.92 


ence times T1 — a decay constant measuring, how probable a qubit stays 
in the state |1) and not |0}, and T2 — the dephasing time measuring how 
long the phase of a qubit stays intact. The circuit depth counts the max- 
imal number of basis operations performed by a single qubit during an 
algorithm. A high circuit depth will result in an accumulation of errors 
during the runtime of the algorithm. 

An additional issue in quantum computing is that only a few opera- 
tions, called basis gates, can be performed on the quantum computer. 
Currently, IBM’s superconducting quantum computers have five basis 
gates: the identity, X-, CX-, and P-gates, and the square root X (SX- 
)gate rotating by 7/2 about the x-axis of the Bloch-sphere [4]. Qiskit 
includes a transpiler, which decomposes a given algorithm into these 
basis gates and optimizes these steps in some way [6]. Nevertheless, 
keeping the available basis gates in mind when developing algorithms 
helps to limit their overall number. 

For preparing data and generating and storing the circuits before 
sending them to the quantum computer, we use a classical computer 
with an Intel Xeon E5-2670 processor running at 2.60 GHz, a total RAM 
of 64 GB, and Red Hat Enterprise Linux 7.9. 
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Figure 2: Circuit depth for varying image sizes and MCRY-/MARY-implementation on 
backend ‘ibmq-toronto’. Mean values of 10 observations in logarithmic scale. 


4 Quantum image encoding 


There are many methods for encoding images in quantum computers. 
One of the most frequently mentioned methods is FROI introduced 
in [2]. Assume that we want to encode a 2” x 2” pixel gray-value im- 
age. We split the required qubits into two parts - 2n qubits for the pixel 
positions and one qubit for the gray-value information. Practically, 
FROI can be implemented on superconducting quantum computers by 
using entanglement between the position qubits and the gray-value 
qubit. We take a closer look at the heart of the FRQI algorithm, the 
multi-controlled y-rotation gate (MCRY). It applies a rotation around 
the y-axis corresponding to the gray-value only if all position (aka con- 
trol) qubits are in state |1). Subsequently, we change the state to which 
the actual phase should be applied by X-gates. Thus,in total we need 
one MCRY gate for each gray-value in the classical image. As dis- 
cussed above, on a real backend, complex operations like MCRY have 
to be constructed by concatenating available basis gates. 

Inspired by [7], we replace MCRY by what we call multi-adapted- 
controlled y-rotation gates (MARY). Our MARY gates need less basis 
gates, especially less of the particularly error-prone CX-gates. Thus, 
the replacement reduces the overall error significantly. Moreover, fewer 
gates and lower circuit depth (Figure 2) speed up calculations. The im- 
pact of replacing MCRY by MARY increases with image size. In MCRY, 
all qubits would ideally have to be connected with each other. Hence, 
missing connections on the real backends have to be circumvented by 
swapping with CX-gates. In contrast, MARY requires a much smaller 
connectivity between the qubits. 
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Figure 3: Results for 2 x 2 gray-value images using the mean of the executions. In the last 
column, some measurement error mitigation techniques have been applied [8]. 


Figure 3 shows the performance on a 2 x 2 sample image. The hard- 
ware induced error is clearly visible in the results achieved on the real 
backend. In fact, there, we can only retrieve the image with acceptable 
error when applying measurement error mitigation [8]. That is, from 
observations on exactly this backend, the distribution of the error is 
estimated. Inversion of the error model then improves the results. To 
our knowledge, image retrieval with FROI for images larger than 2 x 2 
is currently not possible on real backends, see also [8-10]. 

Table 3 shows our findings for the maximum executable and usable 
image sizes for the MCRY- and MARY-implementations. Executable 
here means, it is possible to run the algorithm at all without focusing 
on the outcomes. Usable implies that the relative difference between 
input image and reconstructed image is less than 5%. We clearly see a 
benefit of the MARY-implementation in terms of maximum executable 
image size. However, due to the high noise level of the backends, we 
could not increase the maximum usable image size. 

Having experienced this tight restriction, we still aim at image pro- 
cessing algorithms which are robust to the hardware noise in the cur- 
rent noisy intermediate-scale (NISQ) era and hence executable on the 
real backends. In the next section, we describe a design pattern for 
algorithms meeting these demands. 
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Table 3: Current maximum executable and usable image sizes for MCRY- and MARY- 
implementations on ‘’qasm-simulator’ with 8.192 shots and IBM’s backend 
‘ibmq_toronto’ limited to 64 GB memory. 


maximum executable image size maximum usable image size 


Method ‘qasm-simulator’ ’ibmq-_toronto’ "qasm_simulator’ "ibmq-toronto’ 


MCRY 32 x 32 16 x 16 16 x 16 2x2 
MARY 256 x 256 32 x 32 16 x 16 2x2 


simulator ibmq_ehningen 
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Sample image eS N Pa 


Quantum algorithm artificial simulator ibmq_ehningen 
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vertical 


vertical 


Figure 4: Scheme from [12] for edge detection in a 30 x 30 sample image by using 2 x 
2 filter masks, ‘qasm_simulator’ and backend 'ibmq-ehningen’ (executed on 
October, 15 2021) with 8.192 shots, and ToolIP [13] for post-processing. 


5 Quantum image filtering 


In this section, we introduce a class of hybrid algorithms combining 
classical filtering with quantum computing on 2 x 2 pixel patches. As 
an example, we combine classical edge detection with a quantum artifi- 
cial neuron [11] as sketched in Figure 4. We calculate the inner product 
of the input image patch and the filter mask not only on a simulator but 
also on real quantum computers [12]. Being restricted to 2 x 2 masks, 
we can either apply that directly or split our task into one-dimensional 
filtering steps. The latter is more robust with respect to noise [12]. 
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Figure 5: Hybrid quantum edge detection. Ur encodes the input image patch and Uw 
the filter mask. The gray value information is encoded in the P-gates. In the 1D 
case, the additional diagonal direction is required for detecting corners, too. 


Input image Simulator Real backend 
(256 x 256) ‘qasm_simulator’ “ibmq_ehningen’ 


Figure 6: Results for the 256 x 256 House image [14]. The ‘qasm_simulator’ and backend 
‘ibmq-_ehningen’ results differ only slightly. 


Moreover, only a very small number of gates and only one qubit per 
direction and pixel are required. This ensures that a very small number 
of shots (measurements) suffices for identifying the edges of the image. 
The lower number of shots in turn reduces the execution time signifi- 
cantly. The quantum circuits of the two implementations are shown in 
Figure 5. 

Figure 6 shows the results of our hybrid 2D edge detection for a typ- 
ical toy example image [14]. In [12], we process 256 x 256 pixels gray- 
value images. Further extension to larger images increases the number 
of circuits, only, but does not decrease the robustness of our algorithm. 
Nevertheless, in the end, we create one circuit for each combination of 
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input image patch and filter mask. This can scale up quite fast with 
larger images. In classical computing, this can be compensated by par- 
allelization. In fact, this is also an option in quantum computing. We 
can use several qubits in parallel and process multiple image patches at 
the same time. By that, we decrease the number of needed circuits and 
also the execution time in the end. Mid-circuit measurement [4] allows 
to measure a qubit at any step of the algorithm and use the same qubit 
again for further calculations. 


6 Conclusion 


Quantum computing is potentially very useful in image processing. It 
promises exponentially lower memory usage in terms of qubits com- 
pared to classical bits and also faster calculations. However, the cur- 
rently available noisy intermediate-scale quantum computers are still 
quite error-prone and hardware improvement is subject of vividly on- 
going research. At the moment, image retrieval is only possible for 
images up to a size of 2 x 2. A strategy to deal with these limita- 
tions is to combine quantum and classical algorithms. In such hybrid 
solutions, the quantum computing part is actually much smaller than 
the classical part. We use only a small number of gates, and avoid or 
decrease the number of particularly error-prone types. The quantum 
computing share can be extended gradually along with the hardware 
progress. Instead of trying to implement all image processing func- 
tionality on quantum computers, we should rather identify, for which 
problems and which steps in complex algorithms quantum computing 
can be helpful or eventually even beat classical machines. 
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Abstract The exact measurement of process-relevant parameters 
and product properties are prerequisites for efficient and sustain- 
able production. In addition to accuracy, industrial applications 
place tough demands on the real-time capability and achievable 
measurement rates of the sensor technology. In the past, radar 
signal processing was mainly done with the use of highly spe- 
cialised hardware to achieve the necessary performance. Com- 
puter systems are used to perform simulations and to test new 
algorithms before being implemented under high effort. The re- 
sulting sensor systems are rigid, and their enhancement is time 
and cost consuming. With increasingly powerful graphics pro- 
cessing units (GPU) and the possibility to use them for general- 
purpose computing, anew approach is to outsource parts of the 
radar signal processing from the specialised hardware to com- 
mercially available computer systems. The main objective of this 
idea is to reduce the development time of new sensor systems, 
facilitate their modification and to increase the re-usability of 
produced code. This approach is tested with a new imaging 
radar algorithm, developed for a frequency modulated continu- 
ous wave (FMCW) radar system with a modular multiple input 
multiple output (MIMO) antenna array. The implementation of 
this algorithm is used to determine the boundaries of this new 
approach and involves a step-by-step optimisation process to im- 
prove the performance of the final result. 


Keywords Imaging radar algorithm, FMCW radar, MIMO sen- 
sor system, back-projection, CUDA C++, GPU programming 
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Figure 1: Activity diagram showing the steps of the radar imaging algorithm. 


1 Sensor system 


Based on recent research on MIMO imaging radar sensors and prior 
projects to measure the width of steel slabs in rolling mills [1] [2] [3] 
[4], a new sensor system is planned to not only measure the metal 
slab dimensions, but also to reconstruct high-resolution images of the 
material surface and to determine its speed. 

The MIMO signal processing and its GPU implementation are based 
on a planned radar system with 197 real channels. The sensor operates 
according to the FMCW principle with a 3dB transmission range of 
30GHz (119GHz — 149GHz). The sensor has a simulated resolution of 
0.8mm along the vertical axis at a distance of 500mm. The combination 
of all transmit and receive channels provides a virtual array aperture of 
1300mm distributed over 677 spatial positions with a virtual sampling 
distance of 2A ~ 4.2mm. 

The transmitters are time-multiplexed, and only one transmitter is 
active at once. While the active transmitter is sending the chirp pulse, 
all other antennas act as receiver for the reflected radar signal. 


2 Imaging algorithm 


The input data for the algorithm is any number of pre-determined in- 
termediate frequency (IF) signals stxrx(n) with size N from any trans- 
mitter (Tx) and receiver (Rx). The raw data is transmitted via ethernet 
using an UDP-based data protocol. Those given, the imaging algorithm 
consists of the steps shown in figure 1. 
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2.1 Signal pre-processing 


The pre-processing involves a Hilbert transform function H to get the 
analytic input signal and the application of a complex combined filter 
function wryrx(n) with zero padding. In consideration of these two 
functions, the signal after the pre-processing step is given by (1). 


ee otra EBEN a) 
0 otherwise 

For the actual implementation, the analytic signal is approximated 
through a Fast Fourier transform (FFT) by setting all negative frequen- 
cies in the signal spectrum to zero followed by an inverse FFT [5]. The 
zero padding then appends a specified number of zeros to the filtered 
IF-signal to get a spectrum with lower peaks but higher distance res- 
olution during the image reconstruction. With the speed of light co 
and the radar bandwidth B, the enhanced step size of the range axis 
after the zero padding Ad is given by (2). The possible length Np of 
the padded signal s7,p,(m) will be determined through tests with the 
finished system. 


= coN 
eS 2BNp 


2.2 Combined filter 


The combined filter wy,p,(1) in (3) consists of several sub-filters for 
different tasks. The calibration w.a (7) removes channel response from 
the measurement signal, the Hamming filter H(n) and the Kaiser filter 
Krxrx suppress side lobes and noise in the imaging area and the multi- 
plicity value M7,rx equalises the illumination level along the aperture 
of the MIMO array. The effect of the different filters is shown in figure 
2. 


H(1)Krxrx 


WxRx (11) ay Weal (1) Misi 
X. 


153 


J. Perske, H. Cetinkaya, C. Schwäbig, and S. Gütgemann 


Be 


Figure 2: Output data |I(x,y)| of the imaging algorithm with 970 simulated input signals 
(B = 160GHz — 120GHz, N = 1024, Np = 8192) for a single point reflector with 
(1) No filters (2) Hamming filter (3) Hamming and Kaiser filter (4) Hamming, 
Kaiser and Multiplicity filter. 
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The Hamming filter (4) is applied as a window function over the 
whole signal length in range direction reducing side lobes and noise 
along the x-axis of the reconstructed image. 


2rın 
H(n) = 0.54 — 0.47cos (7 — z) (4) 

The Kaiser filter is not applied over the signal length but over the 
y-axis of the antenna array. A virtual antenna position VTxrx is calcu- 
lated as the midpoint between the transmitter position Pf, and receiver 
position Prx. The virtual y-position of each pair is then used to calcu- 
late the Kaiser value Kr,rx in (5). Here, Io is the zeroth-order modified 
Bessel function of the first kind. 
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Vrurx,y—min(Vy) si 
Io my (2 ei a 1) | 


Io (na) with (5) 


Krurx = 
a —4.0 


The multiplicity value M7xrx for each antenna pair is generated by 
counting the number of overlapping virtual antenna positions within a 
threshold radius around VTyRx- 


2.3 Image reconstruction 


The image reconstruction steps use a back-projection algorithm to map 
the filtered input signals onto a two-dimensional plane. Given by the 
physical properties of the FMCW radar, peaks in the IF-signal spectrum 
correspond to the presence of a reflecting object in the sensors’ field-of- 
view. Therefore, the first step is to perform the FFT of the IF-signal. In 
(6) the signal response of an antenna pair Sq,rx(X,y) is calculated for 
any position in the target area with the distance d1xrx(X,y) between 
the antennas and the pixel position. 


Spxax(&,y) = FFT [stane()] (2) 6) 


Since the FFT is a discrete function, it is not possible to calculate the 
value of Sp,rx(X,y) directly. Therefore, it is necessary to interpolate 
the result of the FFT at the target distance to get a continuous func- 
tion. The efficient implementation of this interpolation through GPU 
texture memory is one of the main aspects of the optimisation process 
described in this work. 

Finally, the signal response of each antenna pair is superimposed to 
get the combined reflection intensity for any target position. With the 
application of a phase correction value ®(d), the reflectivity function 
I(x,y) can be represented as in (7). Here, fo refers to the starting fre- 
quency of the FMCW radar chirp. 
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Figure 3: Activity diagram showing the steps of the velocity estimation algorithm. 


I(x,y) = }, ®[drxrx(x,y)] Str (&,y) with 
TxRx (7) 
iog 


(4) =e 1% 


3 Velocity estimation 


The velocity estimation is based on the imaging algorithm. With a 
MIMO array arranged along the movement axis of the observed object, 
the shift of the object is determined through the cross-correlation of 
two consecutive measurements. The steps of the velocity estimation in 
figure 3 involve the reduction of the image data to an one-dimensional 
function (8), from which the shift of an object is determined through a 
cross-correlation with the previous image data (9). 


Imax(y) = max (I(x, y)|) with Dy = Pe Xmax] (8) 


The cross-correlation is calculated between each Imax; and the previ- 
ous Imax i—1 to determine the shift of the observed object. Here, x is the 
short notation for the cross-correlation. 


Ay = argmax(Imaxi(Y) * Imax,i-ı(Y)) with Dy = Marin Ymax] (9) 
yEDy 


Finally, the velocity is calculated with the timestamps t; and t;_ı of 
the received data (10). 
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4 Implementation details 


The approach of this project is not only to implement the new algo- 
rithm in an efficient way, but also to ensure the re-usability of the de- 
veloped code by creating the foundation for a general-purpose CUDA 
signal processing library. With that in mind, the focus during the de- 
velopment is on modularity, the creation of clean and safe code and a 
proper documentation. 

The software is written in C++17 and is completely object-oriented. 
The different parts involve an advanced memory management, data 
handling, error handling and a flexible structure for the implementa- 
tion of new arithmetic operations. Since CUDA code uses global defini- 
tions for the kernel and device functions and in some cases even needs 
global variables, all those relics from CUDA C are hidden behind proxy 
classes and never exposed to the user of the library. In order to make 
error handling systematic, robust, and non-repetitive [6, E.2], this library 
replaces the return-based error handling from the CUDA Runtime API 
with a throw-based error handling with dedicated exception classes. 


4.1 Memory organisation 


The CUDA Runtime API provides C-like cudaMalloc and cudaFree 
functions for memory allocation and deallocation. This technique is 
still supported but outdated in modern C++ [6, R.10] and therefore, 
those functions are wrapped in the class DeviceArray to perform an 
automatic allocation and deallocation in its Constructor and Destruc- 
tor. Avoiding manual memory allocation reduces the risk of leaks and 
simplifies the memory management. 

Another abstraction layer shown in figure 4 is the implementation 
of different DeviceData subclasses. Those subclasses hold additional 
information about the data dimension and provide methods to access 
subareas of the allocated memory without manual pointer manipula- 
tion. 
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Figure 4: Simplified class diagram of the implemented memory management. Classes 
derived from DeviceArray inherit the automatic device memory allocation. 
Classes derived from SynchronizedArray inherit the automatic host memory 
allocation and synchronisation methods. 


In addition, the SynchronizedDeviceArray classes allocate host 
memory and allow the bi-directional synchronisation of host and de- 
vice memory. The cudaMallocHost method is used to enable a faster 
data transfer during the synchronisation. Virtual inheritance is used to 
resolve the diamond pattern in this design. 

All memory management classes are templates to be usable with 
different data types without code duplication. Those templates are 
specialised through explicit template specialisation, which generates 
the source code for a specific selection of data types during compile 
time. This technique is mainly used to restrict the usage of the template 
classes to only implemented and tested data types. This explicit form 
of generating source code from templates during compilation (template 
metaprogramming) is not recommended in the C++ Core Guidelines 
[6, T.120,T.121], except for the emulation of concepts. Although this 
library would effectively benefit from using concepts, the Nvidia Nvcc 
compiler does not support this new C++20 feature yet. 


4.2 Arithmetic operations 


All arithmetic functions and various memory operations are imple- 
mented using a visitor pattern. The design shown in figure 5 imple- 
ments the different operations as visitors in dedicated classes which 
get called by the memory objects. This design avoids hard binding 
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+ accept(op : DeviceOperation<T>) + accept(op : DeviceOperation<T>) 


Figure 5: Simplified class diagram of the implemented visitor pattern for the arithmetic 
operations. Any DeviceOperation can be applied on any OperationTarget. 
The two operations FFT and Add are examples of concrete DeviceOperation 
classes. 


between the implemented operations and the data on which those are 
applied and facilitates the implementation of new operations. 

An alternative to this design would be a Utility Class implemented 
as Singleton or with the use of static methods. There is a wide dis- 
cussion about the usage of Utility Classes, and in general they are not 
seen as good practice. They break the principles of object-oriented pro- 
gramming by having only one instance of the class, which comes with 
several downsides compared to a regular instantiable class. Moreover, 
the tight coupling to the Utility Class prevents to switch this depen- 
dency by creating a subclass and extending its functionality. 


4.3 Optimisations 


The final implementation of the image processing algorithm is the re- 
sult of a longer optimisation process, and a selection of the interme- 
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Figure 6: Nvidia Nsight Systems timeline for the unoptimised algorithm. Many small 
kernels are launched and executed sequentially. 
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Figure 7: Nvidia Nsight Systems timeline for the second optimisation stage. A spe- 
cialised image processing kernel per data block massively reduces the over- 
head through many small kernel launches. 


diate stages of this process is presented and compared in this section. 
Those stages of the development process are described to explain the 
different design decisions and their effect on the overall performance. 
For comparison, all described tests are performed on the same hard- 
ware with 970 simulated input signals (N = 1024,Np = 819) and a 
target area size of 512 x 512 pixel. The execution time is measured with 
the analysis software Nvidia Nsight Systems, and the GPU activity is 
observed with Nvidia Nsight Compute. 

The first attempt to implement the image processing does not involve 
any specialised kernel functions but only uses general arithmetic oper- 
ations. This stage exclusively aims to check the general functionality 
and to determine the upper bound of performance improvement. All 
operations in figure 6 are executed sequentially with a total execution 
time of 110.0ms. 

As a first optimisation step, the input data is divided into different 
blocks to enable the parallel execution of multiple kernel functions on 
the GPU. Each processing block generates partial image data, which is 
combined to the final image when all processing blocks are finished. 
The assigment of a dedicated CPU thread and CUDA stream to each 
block reduces the overall execution time to 94.3ms. 
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The most obvious drawback of the previous implementation is the 
high amount of small kernel launches, which comes with a large over- 
head compared to a single specialised kernel. The calculation of the 
antenna distance, the interpolation of the FFT data and the applica- 
tion of the complex phase correction consists of five kernel launches 
for each signal. The idea is now to combine those steps in one kernel 
launch per block and to loop over the signals inside the kernel func- 
tion. The result in figure 7 is a reduction of the overall execution time 
down to 14.3ms. 

The most significant proportion of the execution time is still caused 
by the specialised image processing kernel, and reducing its execution 
time will have a large impact on the overall execution time. Further op- 
timisations and the analysis of a single kernel need a deeper look into 
the GPU hardware. The kernel analysis tool Nvidia Nsight Compute 
measures various metrics of a kernel on the hardware layer. That in- 
cludes bandwidth measurements, the utilisation of the different mem- 
ory types, cache and hit-rate analysis and the utilisation of the different 
GPU pipelines. 

The analysis of the image processing kernel reveals a very low L2 
cache hit-rate of 5.66% and a very high utilisation of the GPU integer 
multiplication and floating point operation pipeline (FMA). Both prob- 
lems are targeted by transferring the FFT data into a batched read-only 
1D-texture in which each row is filled with the complex spectrum of 
a single input signal. Thus, the GPU is able to perform more aggres- 
sive caching by ignoring possible write operations and predicting the 
memory access for adjacent rows. Another huge advantage of the tex- 
ture memory is the ability to perform a hardware interpolation during 
memory access which relieves the FMA unit. The analysis of the opti- 
mised kernel shows a L1 and L2 cache hit-rate of over 95%, a balanced 
utilisation of the used GPU pipelines and an overall execution time of 
6.1ms. 


5 Conclusion 
The new imaging radar algorithm was successfully implemented in 


CUDA C++ and was used to perform initial simulations and to esti- 
mate the expected signal processing time of the sensor system. Besides 
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general optimisations, like the usage of streams and the step-by-step 
kernel runtime optimisation, the main outcome is the suitability of the 
GPU texture memory for the back-projection algorithm. A first sensor 
prototype is in the making and will be used for further tests and to 
verify the results of this work. The code developed during this work 
is now the foundation for a general-purpose CUDA signal processing 
library, which will be used and extended in further projects. 
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Abstract In this paper, we developed a tool that uses active 
learning and deep learning together for segmentation of 3D CT 
data. We demonstrate the results of the method using the use 
case of plant segmentation. In addition, we compare the method 
with a baseline and a classical image processing-based algo- 
rithm. 


Keywords Deep learning, active learning, semantic segmenta- 
tion, plant segmentation, image processing, u-net 


1 Introduction 


Automated segmentation of 3D CT data is a vast field of application. 
Especially in the medical environment, there is currently a transition 
from conventional methods based on classical image processing to Ma- 
chine Learning / Deep Learning (ML/DL) based methods [1,2]. Much 
of the aforementioned success of Deep Learning is due to the large 
number of publicly available annotated datasets, for example, the Im- 
ageNet database [3]. One of the major challenges is the necessity to 
acquire sufficient ground truth data for modeling. However, this data 
are usually not available in sufficient quantities, especially for indus- 
trial use cases. Moreover, the annotation of this data turns out to be 
an extremely time-consuming and very expensive task, especially for 
large 3D datasets. 

Thus, we need effective methods to reduce the labeling effort. One 
such method is active learning, a collection of techniques that support 
machine learning algorithms to achieve better results with less labeled 
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training data. The learning algorithm can interactively prompt a user 
to assign the correct labels to new data points. To do this, the algorithm 
should ask questions that promise a high information gain in order to 
keep the number of questions as small as possible. 

These questions, called queries, can be grouped into three main 
types: stream-based selective sampling, membership query synthesis, 
and pool-based sampling. Stream-based selective sampling assumes a 
stream of incoming unlabeled data points x. The current model and a 
measure of informativeness measure I(x) are used to decide for each 
incoming data point whether to ask the oracle for an annotation. In 
membership query synthesis, the data points are not drawn, but rather 
the model generates new data points in a way that it considers infor- 
mative to itself. With pool-based sampling, a batch b is selected from 
the unlabeled dataset. The current model is used to predict the sample 
stack and obtain a measure of informativeness I(b). Based on this mea- 
sure, the best N samples are selected to be annotated by the oracle [4]. 

Overall, Deep Learning has strong capabilities in processing data 
through automatic feature extraction, but requires a very large amount 
of annotated data to do so. Active Learning, on the other hand, has 
the potential to effectively reduce the effort required for labeling. The 
combination of deep learning and active learning support each other, 
so their application potential improves significantly. Therefore, we have 
developed a tool that allows us to apply active learning to the area deep 
learning segmentation of 3D CT data. 

We demonstrate the use of our tooling on the basis of plant seg- 
mentation, as plant breeding has undergone rapid progress in recent 
decades. In this context, targeted plant breeding, for example of 
climate-resistant strains, is also becoming increasingly important [5]. 
Innovative analysis methods, such as 3D segmentation, play an essen- 
tial role in this context, enabling seedlings and seeds to be assessed 
qualitatively. 

The segmentation task here is to divide the 3D CT scan of the plant 
inside a container in folded paper into the classes plant, paper and 
background (see figure 1). Through use of the segmentation the indi- 
vidual plants can be evaluated and classified by downstream applica- 
tions later. It is particularly difficult to distinguish the seedlings from 
the paper. Paper and seedling absorb X-rays to a similar degree, so 
there is virtually no usable contrast difference that could be used for 
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Figure 1: Data for the plant segmentation. The data is split in raw image or volume 
and its corresponding segmentation. On the left the volume is sliced in axial 
direction. In the middle a 3D rendering can be seen. And on the right the 
sagittal/coronal direction is shown. 


segmentation. This is also affected by the limited resolution of only 140 
pm and noise, which is why incorrect segmentations can easily occur. 
Either components of the seedling are assigned to the paper or vice 
versa. This hinders the subsequent assessment of the seedling in the 
downstream application due to incorrectly calculated characteristics. 


2 Methods 


Our method operates in three main phases (see figure 2). In the pre- 
training phase, an initial network (currently 3D U-Net) is trained from 
weak labels. These can be derived from existing classical image pro- 
cessing pipelines, simulations or rough hand-annotations. 

Subsequently, this pre-trained network is passed to the active learn- 
ing phase. The active learning phase itself also consists of several 
steps, namely inference, location, visualization/interaction, and train- 
ing. During inference, the segmentation network generates a segmen- 
tation map, which is then analyzed during the location phase. Then the 
user can visualize the results and interact with them to correct invalid 
segmentations. Next, the areas corrected by the user are retrained dur- 
ing the training phase and the weights of the segmentation network are 
updated. A graphical user interface guides the user through these four 
steps until a visually satisfactory result is achieved or an application- 
specific condition is met. 
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Figure 2: Overall conceptual process of the developed deep learning and active learning 
approach. 


Finally, the resulting fine-tuned network can be deployed. As an 
additional result, all corrections made by the user during the active 
learning phase can be used for future algorithms or training. 


2.1 Pre-training phase and network architecture 


In the pre-training phase, the network is initially trained in such a way 
that later, in the active learning phase, the segmentation is almost cor- 
rect and only invalid segmentations have to be corrected and re-trained. 
For this, already existing classical algorithms (based on thresholding, 
filtering, ...) or simulations can be used as weak labels. 

The U-Net architecture used consists of a simple 3D U-Net (see fig- 
ure 3). It is 5 levels deep with 2 convolution blocks per level. With 
each level the number of feature maps doubles and the spatial resolu- 
tion halves. The convolution blocks consist of 2 convolution layers with 
batch normalization [6], swish activation [7] and a residual connection. 
In the decoding path the feature maps are upsampled and concate- 
nated by simple upsampling. The last layer is 1x1x1 convolution with 
Softmax activation and represents our final segmentation. The entire 
3D input volume is usually too large to be processed at once, so it is 
processed block by block through subvolumes of size 64°. 

The training parameters are set as follows. As loss function we 
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Figure 3: Schematic of the employed u-net network architecture. 


choose the sum of dice and cross entropy loss, as in [8]. For the opti- 
mizer, we use Adam [9] with a learning rate of 3e~4 and cosine decay. 
As regularization, we set a weight decay of 2e~> [10] and also use la- 
bel smoothing of 0.1. Additionally, we use augmentations to increase 
the training data. We use contrast, noise, affine transformations, flips, 
blur and artifacts augmentations with varying strength. The network 
is implemented using TensorFlow [11] and the augmentation pipeline 
makes use of the TorchIO package [12]. The training has been con- 
ducted using a NVIDIA GTX Titan X with 12 GB of GPU RAM. 


2.2 Active learning phase 


After the network has completed the pre-training phase and has 
reached suitable convergence, it is passed on to the active learning 
phase. Here the user is in the focus, and first he is presented with 
the following view within the simple application we developed, with 
which he can interact with the current dataset. In figure 4 three orthog- 
onal sliceable views with which the dataset can be navigated can be 
seen and the toolarea in which multiple tools are available for the user. 
The user has access to brush, image processing operations (flood-fill, 
morphology, clustering, ...), 3D visualization, neural network training 
and use-case specific functions. 

In general, 4 sub-steps are then performed within the active learn- 
ing phase: namely, inference, location, visualization/interaction, and 
training. 

Inference. Here the network prediction is executed. In the field of 3D 
CT data segmentation, the volumes are often large, with sizes of several 
GB or more, which prevents the direct use of a segmentation network 
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5 


Figure 4: Overview of the graphical user interface guides the user through the active 
learning phase. In the top right, top left and lower left, orthogonal sliceable 
views can be seen, that allow the user to navigate the through the data and 
overlayed segmentation. In the bottom right the toolarea can be seen. 


due to the limited GPU RAM. Therefore, to perform segmentation with 
such volumes, we need to split them into smaller blocks (usually 64°). 
Each of these blocks is then segmented individually by the network. 
In addition, overlapping is performed at the edges of the blocks to 
compensate for the lack of spatial information at the edges. Finally, all 
the blocks are merged to form the total volume. 

Location. In the localization phase, the user has to find and cor- 
rect incorrect segmentations. Since this is a very time-consuming pro- 
cess to do manually, we have developed a way to quickly and semi- 
automatically present potentially incorrect areas to the user. To do 
this, we use a random forest that classifies the objects contained in the 
current segmentation. It is trained by the user on the basis of a few 
examples. For this purpose, first the current segmentation is analyzed 
by a connected component analysis (CCA). Then, features are calcu- 
lated for each of the connected objects (e.g. size, mean, eigenvalues, 
...). Now the user has to label at least one object of each desired class 
(for example: paper, seedling, faulty, ...). After that, the random forest 
can be trained and applied to all contained objects. The user is then 
shown the objects that have been classified and can then improve their 
segmentation. The GUI and the random forest pipeline are shown in 
figure 5. 

Visualization/Interaction. After a wrong segmentation has been 
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Figure 5: Location phase overview. On the top the GUI part of the location phase can be 
seen. In the middle the table containing features of the objects can be selected. 
On the left you can see a cutout of the data and the object is slightly highlighted 
in red. On the right a 3D visualization of the object is displayed. The lower 
part shows the pipeline running in the background. The gray text describes the 
data flow from plain voxels to connected objects and their features. 


found in the localization phase, the user has to correct it. This is done 
with the help of the three orthogonal views in the GUI and the avail- 
able tools. Most of the time, the corrections that need to be made are 
small local corrections, such as roots that are incorrectly marked as pa- 
per. However, painting pixels is difficult and painting voxels turns out 
to be even more difficult. That’s why we provided the brush tool with 
special modes for segmenting plants. After all, the brush tool is the 
most commonly used tool for local segmentation changes. It should be 
easy (and fun) to use and support many automatic modes so that the 
user can segment as many voxels as possible by hand with as little ef- 
fort as possible. In figure 6 the brush usage of the brush tool is shown 
along with its special Frangi [13] filter mode. 

Training. After the localization, visualization and interaction phases 
mentioned above, the training phase can begin. The goal of our active 
learning process is that the user annotates as little as possible, but as 
much as necessary to correct the wrong segmentations. Therefore, the 
changes in the loss of the network are also rather small, which could 
hinder the learning of the corrected regions. To compensate for this 
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Black Background 


Figure 6: Example of an incorrect segmentation and its correction. The Frangi filter can 
select tubular structures, which makes it easier to separate them from the pla- 
nar paper, allowing the user to correct the incorrect segmentation more easily 
than by manually tracing each voxel. 


imbalance, voxel-wise loss weighting is used to force learning of the 
regions corrected by the user. The weight calculation is similar to scikit- 
learns class weights function [14]. The training parameters are the same 
as in the pretraining phase mentioned above. 

Iteration. Finally, the figure 7 shows an iteration of the active learn- 
ing process of the developed tool. Starting in the inference phase, the 
current network generates a segmentation. Then, in the localization 
phase, the incorrect region is found and presented to the user. Subse- 
quently, the user corrects the incorrectly segmented voxels. After fin- 
ishing training with the new annotations, the next iteration can start. 
In the upper right of the figure 7, the result after the iteration can be 
seen on another area that was not annotated by the user. 


2.3 Deployment phase 


After the active learning phase has been completed, the resulting fine- 
tuned network can be passed on to the deployment phase. Here, it 
is then used for inference in another application. In the case of plant 
segmentation, the output of the network is used to analyze individual 
seedlings and their characteristics for subsequent seed selection and 
breeding. 
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Figure 7: Overview of the usual active learning workflow with example segmentations 
in the different steps. In the top right the result after the iteration is shown. 


3 Results 


We evaluated our developed tool on the use case of segmentation of 3D 
CT scans of plants. The seedlings grow in a plastic container in folded 
paper. Due to the similar attenuation coefficients, it is particularly diffi- 
cult to distinguish plant and paper. We compare the performance of the 
pre-trained network and the fine-tuned network with the performance 
of a classical image processing-based algorithm [5]. The methods are 
compared visually by inspection and by calculating segmentation met- 
rics. To give no algorithm an advantage, we manually created a ground 
truth scan from the test set from scratch without using algorithmic as- 
sistance. In order not to let the effort explode, we evenly distributed 
two slices from each of the three orthogonal directions (see figure 1) for 
annotation. Each of these six slices took the annotator an average of 20 
minutes, extrapolating to the total scan size of about 800°, this would 
require about 16 days for the entire scan in the worst case, which would 
be impractical. 

Figure 8 show the comparison of the segmentation with the re- 
spective ground truth slice from the two different directions. As can 
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Figure 8: Results for the ground truth slices of different directions (top: axial, bottom: 
sagittal/coronal). The method names and their dice results are shown above 
the image. Various points of interest are highlighted by yellow ellipses. 


be seen, the classical algorithm generally segments the roots more 
sparsely than the ground truth. In some cases, the roots are com- 
pletely missed, which is a fatal error for the final application in the 
deployment phase. The pre-trained network reproduces the errors of 
the classical algorithm, which is to be expected after it has been trained 
with data from the classical algorithm. The fine-tuned network finds 
roots missed by the other two methods, but segments them a bit too 
thick. Nevertheless, such an error is not as serious as missing roots. 

The figure 9 shows the metrics of the different methods. It can be 
seen that all metrics are quite close to each other. The classical algo- 
rithm can only convince in one metric, while the pre-trained network 
achieves the highest score in 2 out of 12 cases. In the remaining 9 out 
of 12 cases, the fine-tuned network achieves the highest scores. This is 
also in agreement with the assessment in the visual inspection. 


4 Conclusion 


Overall, the results achieved with our active learning tooling in plant 
segmentation are very promising. Although all metrics are quite close 
to each other, we have a performance gain of about 1% in terms of the 
DICE score. Furthermore, qualitatively visually, the DL segmentation 
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Metric Accuracy Area under curve Dice Intersection over Union (loU) 
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Background Plant [Background Paper Plant 


Algorithm |Background Paper Plant |Background Paper Paper 


0.9516 


0.9495 0.8499 0.9906 0.8708 0.7456 0.9815 0.7735 0.6064 


1_classical 0.9827 0.9825 


0.9630 


2_pretrained 0.9825 0.9827 0.9958 0.9572 0.9019 0.9905 0.8741 0.9812 0.7789 


3_finetuned 


Figure 9: Table of the calculated segmentation metrics. In the top row, the metric can 
be read. In the second row, the class to which it refers. The last three rows 
show the results of the individual algorithms. The metric of the best method is 
highlighted in green. 


results are ahead. Additionally, we did not use any prior knowledge 
about scan geometry, container, paper and plant type. This makes the 
DL approach much easier to adapt. In the future, other active learning 
approaches or new network architectures can be integrated to further 
increase the performance. 
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Abstract This work presents a signal processing pipeline for an 
autonomous race car in the context of Formula Student. The soft- 
ware used for each step from the detection of objects in camera 
images or lidar point clouds to the calculation of control out- 
puts for the actuators is described in detail. The sensors and 
actuators are covered and the system output is visualized. The 
computational times of the pipeline are analyzed and it is de- 
rived that the complex algorithms used for motion planning and 
SLAM take up the most of the computation times, leaving the 
most room for improvements. 


Keywords Autonomous driving, signal processing, Formula 
Student, YOLO, object detection, SLAM, MPC 


1 Motivation 


The future lies in autonomous driving, at least in the Formula Student 
(FS), an international design competition between student teams. In 
this work, the signal processing pipeline of the 2022 electrical and au- 
tonomous race car of the team CURE (Cooperative University Racecar 
Engineering) is presented. While the Formula Student poses a rather 
narrow challenge for autonomous vehicles due to a controlled environ- 
ment and clearly specified tracks and track boundaries, it is a good 
development and testing ground for algorithms which are also used in 
agricultural, industrial or real-life traffic situations. 
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2 Problem description 


During the FS events, the autonomous race car competes in four dif- 
ferent types of competitions, all posing different challenges to the car 
and the Autonomous System (AS): Acceleration (1), Skidpad (2), Au- 
tocross (3) and Trackdrive (4). The disciplines test the car’s ability to 
(i) drive straight lines (1), (ii) handle high acceleration and deceleration 
forces (1), (iii) withstand high lateral forces (2), (iv) choose the correct 
direction at a known intersection (2), (v) navigate unknown tracks (3) 
and (vi) reliably generate global maps and locate itself in them (3, 4). 
During all events, the track boundaries are marked by cones of known 
sizes [1, Tab. 3]. Small blue and yellow cones mark the left and right 
sides and orange cones signal finish lines and the exit areas in which 
the car has to come to a standstill. The challenge the cars face is to de- 
tect the cones correctly, align the detections with previous knowledge 
about the tracks - either from the competition rules or from internally 
built maps - generate a path to follow and send control signals to the 
actuators accordingly. 


3 System overview 


This section gives a brief overview of the hardware and software used 
to run the pipeline. In it, the processing unit, the sensors and the 
actuators of the race car are described. 

To handle the challenges regarding the computational power and the 
needs of the image processing software, a custom-built Autonomous 
Compute Unit (ACU) consisting of an AMD Ryzen 5 5600G hexa-core 
CPU, aNVIDIA Tesla T4 data center GPU and 32GB of memory is used. 
On it, Ubuntu 20.04 LIS is installed. To implement the various func- 
tionalities of the AS, the Robot Operating System (ROS) Noetic is used. 
This provides the means for inter-process communication, threading, 
debugging as well as visualization tools. In order to simplify develop- 
ment, deployment and maintenance, the complete AS is containerized 
using Docker. 

To interact with the rest of the electrical system in the race car, mul- 
tiple CAN buses are used. To send / receive messages, a CAN to ROS 
interface is used. The sensors connected to the CAN bus include steer- 
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ing wheel angle sensors, wheel speed sensors and an IMU (Inertial 
Measurement Unit). All of them are used as inputs to the AS in order 
to track the car’s position and generate control outputs accordingly. 
These are then used to control the actuators which include the motors, 
the motor for the steering actuation and the electrical valves for the 
brake system. Additionally to the sensors connected via CAN, other 
sensors are directly connected to the ACU via either USB or Ethernet. 
These include a dual-antenna GPS for position and heading informa- 
tion, a stereo camera and a lidar. 


4 Signal processing pipeline 
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Figure 1: Overview of the modules of the signal processing pipeline. 


This section gives a detailed description of the signal processing 
pipeline as a whole and each module in it as shown in Figure 1. 


4.1 Camera perception 


This section focuses on the generation of local maps from images taken 
with a Stereolabs ZED2i stereo camera. 


Camera Calibration Since the camera images are currently used as the 
main way to determine the positions of the cones, the camera needs to 
be calibrated as precisely as possible. The local mapping process is 
closely related to this as it requires both an intrinsic and an extrinsic 
camera matrix to describe the transformation of the cone positions from 
image to world coordinates. With the currently used camera model, 
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the intrinsic matrix is supplied by the manufacturer and not subject to 
change. As of right now, the extrinsic calibration process is done by 
mapping image points to world points. In this case, our world points 
are represented in a 3 x 3 marker pattern whose position we obtain by 
measuring the distance to the camera itself. After capturing an image, 
coordinates of the markers in the image are collected by picking out the 
respective pixels. Using the image points, the intrinsic matrix and the 
world-coordinates of the points, the extrinsic matrix can be computed 
using OpenCV’s function solvePnP() [2]. This method has the benefit of 
only using one image but the measurements of the world coordinates 
by hand and the determination of image points are error-prone and 
add a certain error to the calibration as a whole. Replacing the manual 
steps by automated library functions would bring a huge improvement 
to the accuracy of the resulting calibration. 


Inference An integral part of the camera-based perception is the de- 
tection of differently colored cones in the images the camera provides. 
As these detections are used to calculate the position of the cones rela- 
tive to the vehicle, the task of inference needs to be done both quickly 
and accurately. 

In order to reach this goal, a neural-network-based approach for ob- 
ject detection was chosen. The core element is a YOLOv5 convolutional 
neural network [3], completely based on PyTorch, which makes it eas- 
ier to work with. YOLO networks gained a lot of popularity in the 
last years as they achieve similar, if not better, accuracy than Single- 
Shot Detectors while being significantly faster [4]. Using the repository 
code, a network is trained using both images that were captured and 
labeled by ourselves, as well as additional training data from the For- 
mula Student Objects in Context (FSOCO) repository [5]. To further 
improve the process, pre-trained weights are used which reduces the 
need for a big data set and, consequently, also the time needed for 
training. 

The actual logic for the task of detecting cones is based on an open- 
source inference implementation of YOLOvS5 that leverages the capabil- 
ities of NVIDIAs TensorkT library to further optimize performance [6]. 
Using this camera-based perception pipeline, the vehicle is able to de- 
tect cones in a distance of up to 15 meters on images with a resolution 
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of just 1280 x 720 pixels while achieving inference speeds of around 
50ms on average. While the detections proved reliable under varying 
weather conditions, the neural network struggles in detecting cones 
when there is a strong backlight present, as the camera automatically 
lowers the image brightness as a consequence. Additionally, because 
of the the Python implementation, the processing of 1920 x 1080 pixel 
images that are provided by the camera is not possible without signif- 
icantly sacrificing inference speed. Consequently, the migration of the 
code to C++ would be beneficial in the future. 


Local Mapping In order to generate a map with reference to the car’s 
current position, a translation of cone positions from image to world 
coordinates is necessary. The intrinsic and extrinsic matrices are used 
to project the top middle point of the bounding boxes around the de- 
tected cones from image to world. This projection results in a ray as the 
distance can not be calculated with only the pixel coordinates. To get 
the accurate position, the ray is intersected with a plane at the known 
height of the cones. This is done for all bounding boxes in the image 
resulting in a list of local cone positions to pass on to the rest of the 
pipeline. 

While the calculation of the local maps itself has proven reliable dur- 
ing testing and competitions, its accuracy is highly dependent on the 
accuracy of the camera calibration, so an improved calibration process 
as mentioned above could significantly improve the quality of the local 
maps. 


4.2 Lidar perception 


To increase the robustness of the system as a whole, a Velodyne VLP-16 
Puck Hi-Res lidar is used to generate local maps of the environment as 
well. For reasons of time, the lidar perception module has not actually 
been used during this year’s competitions, However, development and 
tests with a test data set have been done. 

First, the amount of data in the captured point cloud is reduced 
significantly by cropping the field of view in order to increase the com- 
putational performance. Second, the ground plane is filtered out using 
the Himmelsbach algorithm [7]. Once the point cloud only contains 
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points which are not in the ground plane, Euclidean Clustering is used 
to group the points. Then, the shapes of these clusters are checked to 
keep any cone-shaped clusters and remove erroneous detections like 
people, walls and other structures. The coordinates of the detected 
cones are then passed on to the SLAM and track filtering modules. 
Additionally, 3D to image translation is used to add color information 
to the detected cones using information from the camera images. Asa 
side note, it has to be added that while the approach works well, the 
performance of the pipeline is limited by the low number of channels 
of the Puck Hi-Res. Cones which are 6m away from the lidar already 
consist of less than 10 points and the number of points decreases fur- 
ther with increasing distance. 


4.3 Simultaneous localization and mapping 


The main goal of Simultaneous Localization and Mapping (SLAM) is 
enabling motion planning to generate global trajectories and thus, in- 
crease the vehicle performance. The SLAM algorithm is implemented 
as an Unscented Kalman Filter (UKF) in Python. This type of filter was 
chosen as it is able to handle highly non-linear problems like polar cone 
positions more sufficiently than an Extended Kalman Filter (EKF). Also, 
it outperforms Particle Filters or Graph-based SLAM approaches due 
to their higher complexity. The underlying architecture and mathemat- 
ics are based on the open-source library FilterPy [8], however adapted 
to increase speed and compatibility to our system. The tracked state 
vector X of the UKF consists of the tracked landmarks x1,y1ı to Xn, Yn 
and the vehicle pose containing the vehicle position x,y, longitudinal 
vehicle velocity vy and global vehicle heading y: 

During the prediction, the system propagates through a simplified 
bicycle model disregarding any lateral forces and slip angles. The cur- 
rent steering wheel angle is used to calculate the travelled distance of 
the current cornering, while the current longitudinal acceleration ax 
and yaw rate ı measured by the IMU are used to calculate the new ve- 
hicle velocity and global vehicle heading. To continuously update the 
values, the output of the local mapping and lidar perception as well 
as the measurements of all four wheel speed sensors and the GPS are 
used. 

To counter the disadvantage of the O(n?) complexity of the UKF 
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algorithm with n being the number of state variables, the predicted 
state variables are limited to the vehicle pose states and the updated 
state variables are limited to the necessary ones, for instance, only the 
vehicle pose if no lidar perception or local mapping output is available 
and otherwise the vehicle pose and the observed landmarks. As a 
result, the complexity is nearly constant since the number of observed 
landmarks is naturally limited. 


4.4 Track filtering 


The track filtering module calculates the center point line of the track 
and the track width using the position and color of the cones. The 
general functionality of this module is split up into local and global 
filtering, based on the information passed on by the SLAM algorithm. 
The local track filtering follows three steps. First, it finds the mid- 
points of the track using different approaches based on the number and 
color of cones available from SLAM. For only cone or one color avail- 
able either the Dynamic Window or Border Shift approach is used. If 
more cones of each color are passed on, then the midpoints are cal- 
culated with the Delaunay Triangulation. With the variety of possible 
approaches, the reliability of this module can be enhanced. The second 
step is to interpolate and approximate the center line from the found 
midpoints. The third and last step is the definition of the legal track 
width for each point and the calculation of the left and right borderline. 
This information is then passed on to the motion planning module. 
The global track filtering works very similar to its local counterpart, 
except that it uses only the Delaunay Triangulation for finding the mid- 
points, since all global cone positions are known. They are sorted with 
a tree algorithm and used to calculate the track width and border lines. 


4.5 Motion planning 


The goal of the motion planning module is to generate a trajectory 
to enable dynamic racing maneuvers. Therefore, it is separated into 
two parts: local and global. The local one is used when no closed 
global track is passed on by the SLAM algorithm. It is also used while 
the global optimization is still calculating the optimal race line for the 
closed and global track. 
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Local Motion Planning The local motion planning uses a directed geo- 
metric graph-based approach fully written in Python and based on [9]. 
The current vehicle position is used as origin. In regards to the cen- 
terline of the track, normals are calculated at regular intervals. The 
layers of the graph are made up of nodes which are evenly spaced on 
the normals. From one node, an edge to every node on the next layer 
exists. To generate a curvature-optimized race line, a cost is calculated 
for each edge. The cost takes into account the average and maximum 
of the squared curvature of the edge and its length. Using the known 
costs of all edges, the cheapest path can be found. The least-cost path 
represents the most curvature-optimal path, for which a velocity profile 
is then calculated. This velocity profile is calculated based on the hy- 
pothesis that the lateral velocity of the car at the apex point of a curve 
is 0 2. Due to this hypothesis, the maximum accelerating and decel- 
erating velocity profiles are calculated from a ggv-map - it delivers the 
maximal acceleration forces - these two profiles are then superimposed. 


Global Motion Planning The global approach is also based on curva- 
ture optimization and inspired by [10]. For generating the global race 
line, the problem is set up as a quadratic programming problem. The 
global algorithm tries to minimize the sum of the curvature for a given 
reference line. In this specific use case it is the closed global center line 
that is passed by the SLAM algorithm and used as reference line. In the 
following the quadratic solver tries to minimize the curvature via mov- 
ing the way points on their normal vectors. The output path is then 
shifted into a trajectory using the same velocity profile calculation as 
the local approach. The global approach uses more computing power 
and takes more time to be calculated. Therefore, as mentioned above, 
the local optimization algorithm continues until a global trajectory is 
determined. 


4.6 Model predictive control 


The control module uses the trajectory from the planning module and 
the vehicle states from SLAM as input to control the vehicle dynam- 
ics. More precisely, the goal is to control the vehicle movement along 
the planned path. This Path Tracking problem [11] aims to minimize 
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the delta between the vehicle and the path points as well as to assure 
progress along the race track. A nonlinear model predicitve controller 
(NMPC) was developed to reach this goal. In general, aNMPC consists 
of three key components: A nonlinear vehicle model, an optimization 
objective and a reliable numerical solver. Based on the vehicle model, 
the solver calculates the optimal control values in real time in order to 
minimize a cost function, which serves as the optimization objective. 
In comparison to classic control theory approaches, the NMPC is able 
to predict and control the future behaviour of vehicle states inside of 
the prediction horizon. Hence, model predictive controllers are very 
popular for autonomous vehicles. The vehicle model is described as 
a nonlinear state space, that outlines the vehicle dynamics. We use a 
kinematic bicycle model, that neglects tire forces, similar to [12] and to 
the one used inside the SLAM algorithm. The model is implemented in 
Python as a system of time-continuous differential equations with the 
vehicle acceleration a, and the tire angle rate 6 as model input. To dis- 
cretize the model, a 4" order Runge-Kutta integrator is used. In every 
time step, the NMPC calculates the optimal input vector to solve the 
optimization objective. Based on the sign of the input acceleration ax, 
this value is transformed to either a pneumatic brake pressure or a mo- 
tor torque value. These control values are published to ROS and then 
transmitted via CAN to the low-level control devices. Furthermore, 
these control values are filtered with an IIR-Filter to counter noises and 
outliers from the whole pipeline. The optimization objective is math- 
ematically described as a quadratic cost function, where the squared 
difference between the predicted vehicle positions and the reference 
positions are summed up over the prediction horizon. The reference 
positions for every time step along the prediction horizon are derived 
by preprocessing the trajectory similar to [12]. Path points and the ve- 
locity profile are used to calculate a time profile, which then is used to 
extract the exactly-timed reference positions inside the prediction hori- 
zon. To solve the optimization objective in real-time, the FORCESPRO 
NLP solver by Embotech is used [13]. This solver predicts input values, 
which minimize the cost function. With solving times below 5 millisec- 
onds this solver is very reliable for application inside the ACU. Due to 
the prediction horizon of N = 20 and a time step of 50 ms, the NMPC 
is able to predict and control the vehicle dynamics one second ahead 
of the current state using the kinematic bicycle model. 
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5 Results 


Since the benefits and drawbacks of the single modules were already 
explained in Section 4, this section focuses more on the results and 
computation times of the whole pipeline. 


Camera Wrapper 
Inference 


Local Mapping 
SLAM 
Filtering 


Motion Planning 
0.023 


Control 


T T T T T T T T 
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 


Figure 2: Median processing time of each module of the pipeline in seconds. 


Figure 2 shows the median processing time of each module and sub- 
sequently of the whole signal processing timeline. It takes about 175ms 
from the recording of an image until it is represented in the control 
output. The camera wrapper includes the recording and processing of 
the image and the encoding into a ROS message. Computation heavy 
modules like the SLAM and motion planning module have a major 
share, which can be lowered by using a different parameter set, mi- 
grating to a more efficient language like C++ and parallelizing specific 
computations. The processing time of the control module is mislead- 
ing, since it is executed with a fixed rate of 20 Hz and thus the median 
processing time includes idle time. Modules like the inference, local 
mapping and filtering provide little room for improvement since most 
of their calculations are carried out with efficient libraries like YOLOv5 
or NumPy. 

Under the assumption that the vehicle velocity is 152, the total pro- 
cessing time will lead to a loss of 2.65m effective perception range. 
Since the position dependent modules (filtering, motion planning and 
control) use separate and newer vehicle position, the control output er- 
ror due to a wrongfully assumed vehicle position is limited and can be 
compensated by choosing a corresponding time step of the NMPC. 
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Figure 3: Output of the pipeline visualized on a camera image. 


Figure 3 shows the system output visualized on a camera image cap- 
tured on a testing day with a driver. The detected bounding boxes of 
the inference module as well as the calculated center points (green), the 
planned path (blue) and predicted path (purple) are shown. 


6 Conclusion and outlook 


In this work, the signal processing pipeline for an autonomous race car 
in the context of Formula Student competitions was presented. Each 
module was explained in detail, also focusing on its positive and nega- 
tive aspects regarding computational cost and reliability. The resulting 
output of the system was visualized and the computational times of 
each module were analyzed and put into context. 

To improve the system in the future, the plan is to improve the cali- 
bration method used for the camera perception to enhance the accuracy 
of the local maps from images. Furthermore, an investment in a lidar 
with more than 16 channels, Gaussian channel distribution is planned 
and work is done to correctly integrate it into the system. Finally, the 
computational times of the motion planning and SLAM modules will 
be reduced by migrating them to C++. 
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Abstract Societies depend on the unrestricted availability of 
their infrastructures. Events such as (natural) disasters, emer- 
gencies, or even attacks, could threaten their safety and security. 
Indoors models provide relevant information that could help in 
this regard. Their floorplans contain key information such as 
their location, design, and layout. The architecture, engineering, 
and construction (AEC) community work together to create the 
respective indoor models within the Building Information Mod- 
elling (BIM) framework. BIM modelling has recently gotten the 
attention in the computer vision domain. The 1st international 
Scan-to-BIM challenge, organised within the CVPR 2021 confer- 
ence, helped to establish research interest and common goals 
between the AEC and computer vision community. In this pa- 
per, we introduce a method to estimate floorplans from 3D point 
cloud data by using the Scan-to-BIM dataset. Our work has been 
developed by using image processing techniques. It does not aim 
to replace state-of-the-art approaches, which are more elaborate 
and robust. Instead, it constitutes anon CPU intensive alterna- 
tive that fairly estimates floorplans for the Scan-to-BIM dataset. 


Keywords Floorplan estimation, 3D point clouds, Scan-to-BIM, 
data and image processing 
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1 Introduction 


Modern societies depend on the unrestricted availability of their 
critical infrastructures [1], where buildings constitute main terrestrial 
infrastructures. They impact our quality of life in many of the same 
ways as other infrastructures. To protect them from dangers is 
essential for prosperity and social stability. Events such as (natural) 
disasters, emergencies, or even attacks, could threaten their safety and 
security [2]. Therefore, it is important to gather detailed information as 
well as to provide indoor models [3]. The Institutes for the Protection 
of Terrestrial and Maritime Infrastructures, subscribed to the German 
Aerospace Center (DLR), are dedicated to develop concepts and 
technologies to help to improve the safety and security of critical 
maritime and terrestrial infrastructures. 


The floorplan of buildings becomes a relevant representation of 
their interiors. In the architecture, engineering, and construction 
(AEC) community, it is standard that such models are done manually, 
being prone to human errors. Additionally, due to renovation and 
maintenance, floorplans are often outdated. Moreover, there are other 
cases, where the floorplans do not even exist [4]. In recent years, with 
3D point cloud scanning and technologies such as building informa- 
tion model (BIM), the modelling has become a common practice [3]. 
Although it still encounters computational challenges such as data 
diversity, accurate geometry, large-scale input, etc. [5], it is currently 
an active area of research. 


Computer vision has already made progress in the detection of walls 
from buildings [6]. Deep learning has shown promising potential in 
object detection [7] or in room layout reconstruction tasks such as seg- 
mentation and parsing geometry [8-12]. Deep neuronal networks have 
also been applied to floorplan reconstruction [13-15] (see Section 2). 


In this paper, we propose an automatic and light alternative to esti- 
mate the 2D floorplan from 3D point cloud data by implementing an 
image processing approach. This work is structured as follows. Sec- 
tion 2 reviews relevant literature. Section 3 describes the methodology 
adopted in this work. The results are presented in Section 4 and dis- 


190 


Indoor floorplan estimation for Scan-to-BIM 


cussed in Section 5. The conclusion is presented in Section 6. 


2 Relevant works 


Computer vision and deep learning tasks have made an effort to 
reconstruct indoor floorplan environments. In computer vision, some 
representative works are for instance [14] and [16]. The authors 
generate the 2D floorplan by using line detection algorithms such as 
CANNY [17] and RANSAC [18]. In the latter case, the output model 
is provided within the BIM format. It is important to note, however, 
that first the reconstruction is only based on walls, i.e. excluding 
information such as doors and stairs. Second, the reconstruction 
follows the Manhattan-layout assumption, i.e. the orientation of the 
floorplan can only be horizontal or vertical. 


In deep learning, Floor-Net [13] and FloorPP-Net [15] are represen- 
tative frameworks to reconstruct floorplans from 3D point clouds. By 
using the Scan-to-BIM dataset [19], FloorPP-Net converts it into point 
pillars. Then the network learns to predict the corners and the edges, 
generating the desired floorplan output model. Again the final model 
is only based on walls. Computer vision and deep learning ap- 
proaches are currently working to include the information of doors 
and stairs in their future models. However, due to class imbalance (i.e. 
data(wall) >> data(door) or data(stair)), this aim is a relatively difficult 
to accomplish. Besides, due to data pre-processing and algorithm im- 
plementation (line detection for computer vision or neuronal networks 
for deep learning), these frameworks could take up to several minutes 
to compute (~5 minutes) and require special graphical processing unit 
(GPU); making them computing time intensive. 


3 Methodology 


3.1 Dataset 


In this paper, we introduce a method to estimate floorplans from 3D 
point cloud data by using the Scan-to-BIM dataset. The dataset has 
been obtained from the 1 International Scan-to-BIM Challenge [19]. It 
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was published at the workshop Computer Vision in the Built Environment 
(CVBE) as part of the Computer Vision Pattern Recognition (CVPR) con- 
ference in 2021 [20]°. The dataset includes a wide variety of construc- 
tions such as libraries, office labs, short-, medium- and large-offices as 
well as parking sites. The sample contains a total of 31 buildings with 
multiple floors each and dozens of rooms on each floor. For 20 build- 
ings it also contains floorplan ground truths. The labels range from 
wall, door, stair, etc. 


3.2 Framework 
The methodology developed in this work is based on image processing 
techniques. The framework is implemented in two-stages. 


Algorithm 1 


The first-stage consists of the construction of a 2D histogram where 
all data-points are projected to the x — y plane with the bin size as 
parameter (bins). The histogram returns the x and y edges of the grid 
(i.e. Xedges ANA Yedges) as well as the number of data points per the 
bi-dimensional bin (H) computed in log-scale. 


Algorithm 2 


The second-stage computes the floorplan estimation in the following 
way: 
1. The consideration of the output of Xeages, Yedges, and H computed 
in the first-stage. 


2. The values are normalised with respect to the bin-size to make 
our method independent of the dimension of the input point 
cloud. 


3 The CVPR 2022 hosted the 2"4 version of the CVBE-workshop, where the same dataset 
has been made available. 
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3. The ground truth, provided for 20 buildings by the Scan-to-BIM 
dataset, supply the annotation of the labels by segments. A seg- 
ment is defined by two coordinates; i.e. (x1, y1) and (x2,y2). For 
each segment, the class-category (e.g., wall) is assigned to bins 
which distance to that segment is smaller than a Criterion (Crit) 
and if the content of that bin, namely H, is larger than the thresh- 
old (Thr). 


The proposed methodology has been applied to several cases. The 
numeric values of the parameters are: bins = 1000, Crit = 25, and 
Thr > 0. The parameters have been selected after a grid search. They 
optimise our results without incurring in over-fitting. 


3.3 Metrics 


We aim to evaluate the position and length of the detected features 
(e.g. wall) by using the precision and recall. Based on true-positive (TP), 
false-positive (FP), and false-negative (FN), we calculate the recall as 
follows: 


TP 
ee E 1 
Precision FP+TP (1) 
TP 
Recall = EN+TP (2) 


where: 


e TP refers to the area of the detected feature (e.g. wall) that is that 
feature (e.g. wall) in the ground truth. 


e FP refers to the area of the detected feature that is not a feature 
in the ground truth. 


e FN is the area that is the feature in the ground truth but is not 
detected as a wall by the proposed algorithm. 


Finally, the Structural Similarity Index (SSIM) has also been calcu- 
lated following the equation 13 of [21]. This is an image quality assess- 
ment to compare two images for structural information ranging from 0 
(no similarity) to 1 (similar). More details can be found in [21]. 


193 


O.H. Ramirez-Agudelo et al. 


13,5 } 
sn Saar 1 
# 4 10 1 
Pr ~~ soo) i 
r 7 2s N 
f = 550 1 
ss f ~ 3 ; 
= 4 = 24 { 
E ` H 1 
su ’ A rm Ar >= 540) Í 
su + $30 j 
si two 1 
r 1 
S20) ‘ as 520) 1 
I 
j j 
Singa | Lig, $10 
N TH pn R 


E? 50 30 sa 30 35 
X (m) X (m) 


(m) 


Figure 1: Small building of the Scan-to-BIM Challenge (see Section 3.2). Left-panel: Point 
Cloud 12 SmallBuilding_02 F1. Middle-panel: 2D histogram. Outcome from 
Sect. Algorithm 1. Right-panel: Floorplan estimation. Outcome from Sect. Al- 
gorithm 2 (see Sect. 3.2). Labels: walls (black), doors (purple), and stairs (gold). 


4 Results 


The proposed methodology presented in Sect. 3 has been applied to 
two different cases. They belong to the training set of Scan-to-BIM 
dataset. Both have ground truth annotations with three categories: 
wall, door and stair. 


4.1 Small building 


Figure 1 presents the first experiment. Left-panel shows the point cloud 
of the first floor of a small building with about 17 million data points. 
First of all, note that this point cloud does not follow the Manhattan 
layout, i.e. the orientation of the walls of the building does not follow 
a horizontal or vertical orientation [6]. Second, the data points do not 
have information of the ceiling or floor. Third, the content of clut- 
ter or noise is minimal. Therefore, this becomes an excellent study case. 


Middle-panel is the outcome of applying the steps described in 
Algorithm 1 to the left-panel. It shows the 2D histogram where the 
maximum value of H is about Hix = 3.5. Right-panel has been 
constructed by applying the steps described in Algorithm 2 (see 
Sect. 3.2). There, the floorplan estimation of the Small building has been 
obtained with label annotations. Doors and stairs are rather difficult 
to retrieve. 
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Figure 2: Office Lab of the Scan-to-BIM Challenge (see Section 3.2). Left-panel: Point Cloud 
12_SmallBuilding_02_F1. Middle-panel: 2D histogram. Outcome from Sect. Al- 
gorithm 1. Right-panel: Floorplan estimation. Outcome from Sect. Algorithm 2 
(see Sect. 3.2) Labels: walls (black), doors (purple), and stairs (gold). 


Besides, considering the definition of precision and recall (see 
Sect. 3.3), for the feature wall then: precision= 1 and recall= 0.54 
(TP = 11527, FN = 9723 and FP = 0). 


4.2 Office Lab 


Figure 2 presents the second case. Panel a) shows the Office Lab with 
about 120 million points. This case follows the Manhattan layout. 
However, it has information of the ceiling. This information needs 
to be removed. Therefore, this experiment constitutes a much more 
complex case to study. 


Following the work of [16], we first proceed with an analysis of 
height to take out the ceiling as well as the clutter (see sections 3.1 
and 3.2 fthe mentioned paper). The point cloud is reduced to about six 
million points. Afterwards, we continue with the implementation of 
our framework. Panel b) shows the 2D histogram. The maximum value 
of H is about Hingy = 2.5. Panel c) shows the floorplan estimation 
accounting for the label-categories. Once again doors and stairs are 
rather difficult to retrieve. 


As for the metrics defined in Sect. 3.3, for the feature wall then: pre- 
cision = 1 and recall = 0.43 (TP = 16158, FN = 21853 and FP = 0). 


* Repository available in [22]. 
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5 Discussion 


5.1 Floorplan: Estimation vs. reconstruction 


Figures 3 and 4 compare the floorplan estimation for the Small building 
and Office Lab, presented in Figs. 1 and 2, with the ground truth. 
Table 1 provides statistical insight to our findings. For the Small build- 
ing, the ratio between the number of points of the estimated feature 
divided by the total number of points of the ground truth of that 
feature, i.e. wall, door and stair are: 54%, 40%, and 6%, respectively 
(see values in Table 1). Note that the ground truth presents a room 
around the coordinate (X,Y) = (578,556) (see Fig. 3 right-panel) that 
is not present at all in the original point cloud data (see left-panel 
of Fig. 3). This inconsistency is intrinsic to the dataset. Although it 
contributes to the discrepancy in our results, it does not explain the 
difference altogether. 


The ratio for the Office Lab are: 43%, 26% and 4%, respectively. Due 
to a class imbalance”, the identification of doors and stairs is limited. 
This is a well known issue in the literature, where it is common to 
provide floorplans purely based on walls, e.g. [15,16] (see also Sect. 5.2). 


In our approach, the FPs are zero for both cases (see Sect. 3). 
Thus, the ratio and recall have the same values. The SSIM for the Small 
building is 0.91 and for the Office Lab is 0.86 (see Table 1), indicating that 
the floorplan estimation and ground truth, at least for walls, are similar. 


5.2 Comparison to other methods 


State-of-the-art (SOTA) approaches (i.e. computer vision or deep 
learning) make use of metrics such as Intersection over Union (loU), 
recall and precision. For instance, the computer vision work of [16] 
presents great results in their experiments with a precision and recall 
over 90%. Similarly, the deep learning work FloorPP-Net [15] by using 
the Scan-to-BIM dataset reported a precision of 7%, recall of 39% and 
an IoU of 12% in the floorplan only based on walls (i.e. without 


5 data(wall) > data(door) or data(stair). 
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Figure 3: Floorplan estimation (left-panel) vs. the ground truth (right-panel) for the Small 
building presented in Sect. 4.1 where the walls (black), doors (purple) and stairs 
(gold) are shown. 


Figure 4: Floorplan estimation (left-panel) vs. the ground truth (right-panel) of the Office 
Lab presented in Sect. 4.2 where the walls (black), doors (purple) and stairs 
(gold) are shown. 


including information of any other feature such as door or stair). 


Computer vision and deep learning are still improving not only in 
the automatic detection of walls but also in the detection of doors and 
stairs. However, it is important to note, the calculation could take up 
to several minutes to compute and often require a special graphical 
processing unit (GPU). 
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Table 1: Columns 1-4: Number (#) of data points per class category for the floorplan es- 


; 7 . #points Estimation 
timation (our method) vs. Ground truth. The ratio ee )- Column 


5: Result of the Structural Similarity Index (SSIM) between the estimation and 
ground truth. 


— Wall Door Stair |SSIM 
= (# points) (# points) (# points)| [0,1] 
Small building 
Estimation 11527 724 4 0.91 
Ground Truth 21250 1791 71 í 
ratio 54% 40 % 6% — 
Office Lab 
Estimation 16158 818 100 0.86 
Ground Truth 38011 3181 2698 : 
ratio 43% 26 % 4% — 


Comparing this work to SOTA, and by considering that the metric 
SSIM can be understood as a proxy of IoU, the results of this work 
compare well (see also recall). Besides, it can be seen as an alternative 
method to estimate floorplan of buildings. It constitutes a light im- 
plementation (i.e. CPU-based), which provides fast and fair floorplan 
estimation for the Scan-to-BIM dataset. By virtue of its simplicity, in the 
future, its implementation will be extended to other datasets . 


6 Conclusion 


Based on image processing techniques, we develop an alternative 
method to estimate floorplan of buildings in the Scan-to-BIM dataset. 
Our method does not aim to replace state-of-the-art approaches, which 
are more elaborate and robust. It, however, provides a fair automatic 
floorplan estimation, which may lead to the reconstruction of floor- 
plans. 
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Zusammenfassung Diese Arbeit befasst sich mit der Inferenz 
von Eckpunkten von Kartonagen, die in einem regelmäßigen 
dichten Packmuster flächig angeordnet sind. Als Sensordaten 
werden ausschließlich 2D Kamerabilder und keine 3D Informa- 
tion benutzt. Die Kartonagen werden aus extremen Perspekti- 
ven betrachtet, wie sie typischerweise beim „Blick ins Regal” für 
automatisierte Kommissionieraufgaben vorkommen. Ausgehend 
von vier Eckpunkten einer beliebigen Kartonage wird ein auf 
dem Doppelverhältnis basierendes Verfahren vorgestellt, das die 
Eckpunkte aller möglicher benachbarter Kartonagenanordnun- 
gen berechnen kann. Des Weiteren wird die Fehlerfortpflanzung 
unter der Annahme von Eckpunktmessungen mit normalverteil- 
tem Rauschen betrachtet und aus der Fehlerverteilung ein para- 
metrisches Modell für die ortsvarianten 2D Wahrscheinlichkeits- 
verteilungen aller abgeleiteter Eckpunkte ermittelt. 


Schlüsselwörter Mustererkennung, Robotik, perspektivische 
Invarianten 


Abstract This work deals with the inference of corner points 
of cardboard boxes, which are arranged two-dimensionally in 
a regular dense packing pattern. Only 2D camera images and no 
3D information are used as sensor data. The cardboard boxes are 
viewed from extreme perspectives, typically encountered when 
“looking at the shelf” for automated picking tasks. Starting from 
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four corners of an arbitrary cardboard box, a method based on 
the crossratio is presented that can compute the corners of all 
possible neighboring box arrangements. Furthermore, the er- 
ror propagation assuming corner point measurements with nor- 
mally distributed noise is considered and a parametric model for 
the 2D probability distributions that vary across image location 
of all derived corner points is obtained from the error distribu- 
tion 


Keywords Pattern recognition, robotics, perspective invariants 


1 Einleitung 


Das dieser Arbeit zugrundeliegende Forschungsprojekt beschäftigt sich 
mit der Objekterkennung für Intralogistikanwendungen. Die hier be- 
handelte Problemstellung ergibt sich aus einem Projekt mit einem In- 
dustriepartner zur Entwicklung eines mobilen pick-and-place Roboters 
zur automatisierten Kommissionierung diverser Warentypen. Der Ro- 
boter soll im Mischbetrieb mit menschlichen Arbeitskräften zur Kom- 
missionierung von Mischpaletten eingesetzt werden, wodurch eine In- 
strumentierung der Umgebung nur eingeschränkt möglich ist. Daraus 
ergeben sich insbesondere für die Erkennung der Waren einige Her- 
ausforderungen. Im Kommissionierbereich sind sich stark ändernde 
Lichtverhältnisse durch Sonneneinstrahlung und Verschattung vor- 
herrschend. Zusätzlich können sich die visuellen Objekteigenschaf- 
ten durch Verschmutzung der Waren verändern. Insbesondere ist aber 
durch die Palettenhöhe und den daraus resultierenden Blickwinkel auf 
die Palette teilweise nur eine extreme Perspektive zur Objekterkennung 
vorhanden (siehe auch Abbildung 1). Da dies ein häufig auftretendes 
Problem bei der Objekterkennung ist, wurden in der Literatur bereits 
verschiedene Größen untersucht, die invariant bezüglich perspektivi- 
scher Verzeichnung sind [1]. Eine der untersuchten perspektivischen 
Invarianten ist das Doppelverhältnis, das sich als robustes und genau- 
es Maß erwiesen hat [2]. Des Weiteren wurde das Prinzip des Doppel- 
verhältnisses erweitert, um Flächeninvarianten unter projektiven Ab- 
bildungen zu erhalten [3]. 

Um die Produkte greifen zu können, muss ein Greifpunkt ermittelt 
werden. Dafür würde eine Rekonstruktion des Packmusters im Bild der 
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Abbildung 1: links: Originalaufnahme der Kamera, rechts: Auflösungsverlust durch 
Transformation in eine Draufsicht. 


Palette gute Kandidaten für die Greifpunkte liefern, z. B. die Mitte der 
segmentierten Kartonage. Um eine Rekonstruktion des Packmusters 
zu erreichen, möchte man Informationen nutzen, die typischerweise 
verfügbar sind, z. B. Größe und Geometrie der Objekte (Palette, Kar- 
tonage, usw.). Eine Möglichkeit zur Rekonstruktion besteht darin, eine 
Draufsicht der Szene zu erstellen. Ein Beispiel dafür ist in Abbildung 1 
zu sehen, wo ein deutlicher Auflösungsverlust und große Interpolati- 
onsartefakte für die Kartonagen im hinteren Bereich zu erkennen sind. 
Das kann zu Fehlern bei der Rekonstruktion des Packmusters der Pa- 
lette führen. Zudem muss für die Transformation in eine Draufsicht 
die Pose der Kamera in Bezug zur Oberfläche des Packmusters bzw. 
die Homographie [4] aus den Bilddaten rekonstruiert werden. 

Im folgenden Abschnitt 2 wird ein Ansatz vorgestellt, der direkt 
auf das Bild ohne vorherige Transformation angewendet werden kann. 
Unter Verwendung des Doppelverhältnisses, sowie der Breite und 
der Länge der Kartonagen, geben wir eine Formel zur Berechnung 
möglicher Eckpunkte von Kartonagen im Packmuster an. Abschnitt 
3 zeigt die Ergebnisse unserer Methode, angewandt auf ein Beispiel- 
bild. Außerdem werden die Auswirkungen von Messungenauigkeiten 
bei der Extraktion der Eckpunktkoordinaten der Referenzkartonage auf 
die Genauigkeit der Eckenerkennung der Kartonagen untersucht. Ab- 
schnitt 4 fasst die Ergebnisse zusammen und gibt einen Ausblick auf 
zukünftige Arbeiten. 
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2 Herleitung der Inferenz von Eckpunkten 


Ausgangspunkt ist eine Ansicht von oben auf eine einzelne Kar- 
tonage als Referenz. Die Koordinaten der Eckpunkte werden mit 
N x bezeichnet!, wobei die hochgestellten Zahlen ein loka- 
les Koordinatensystem für jeden Referenzeckpunkt darstellen. Wenn 
das Seitenverhältnis der Kartonage bekannt ist, lassen sich die 
Möglichkeiten berechnen, wie weitere Kartonagen angelegt werden 
können. Für das Seitenverhältnis 2:1 zeigt Abbildung 2(a) die drei 
Möglichkeiten, wie eine zweite Kartonage auf der rechten Seite ange- 
ordnet werden könnte und Appr dung 2(b) für die obere Seite. 

Die Punkte ae 20 und u xe sind die möglichen Eckpunkte an- 
grenzender Kartonagen i in x und y Richtung vom Eckpunkt x;. Sie wer- 
den im Folgenden inferierte Eckpunkte erster Ordnung genannt, da sie 
direkt mit dem Doppelverhältnis berechnet werden können. Die Punk- 
te xll, x1? und x?1,x?2 werden inferierte Eckpunkte zweiter Ordnung ge- 
nannt, da sie sich wiederum von den Punkten erster Ordnung ableiten 
lassen. 

Daraus ergibt sich die in Abbildung 3 gezeigte Konfiguration von 
Punkten. Aus dieser Sicht lassen sich die nächsten Eckpunkte direkt 
berechnen, wenn man die Länge L, die Breite B und die Koordinaten 
der Eckpunkte a der Referenzkartonage kennt. Bei einer perspektivi- 
schen Ansicht sind jedoch die Distanzen zwischen den Eckpunkten von 
Kartonage zu Kartonage unterschiedlich und verändern sich zusätzlich 
bei der Veränderung der Perspektive. In diesem Fall kann zur Berech- 
nung der Eckpunkte das Doppelverhältnis genutzt werden. 


Doppelverhältnisse in Packmustern 


Das Doppelverhältnis [A, B;C,D] von vier auf einer Geraden liegenden 
Punkten A, B,C, D ist definiert durch 
AC-BD 
|A, BC, D] := ARD (1) 
BC-AD 
Das Doppelverhältnis ist eine projektive Invariante, d.h. es bleibt durch 
projektive Abbildungen unverändert [5]. Um die nächsten Eckpunkte 


1 Aus Übersichtsgründen werden an manchen Stellen bei den Punkten x2, x00, x20, „00 


die hochgestellten Zahlen weggelassen. 
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x 01 
c 
00 00 10 20 00 00 10 00 
x a x x x % x x, x, x.” 
00 00 10 20 00 00 10 00 00 10 
Xa Xp x, x, x, x, x, x, x, xX 
X 01 X u 


(b) Anordnungen von Kartonagen mit Seitenverhältnis 2:1, die übereinander liegen. 


Abbildung 2: Grüne Punkte sind die korrekten inferierten Eckpunkte erster Ordnung. 
Rote Punkte sind Eckpunkte erster Ordnung aus anderen möglichen Kon- 
figurationen. Gelbe Punkte sind inferierte Punkte zweiter Ordnung. 


unter einer perspektivischen Ansicht abzuleiten, werden zuerst aus der 
bekannten Konfiguration (siehe Abbildung 4) mit beliebigen L, B und 
Xa, Xb, Xc, Xq die vier möglichen Doppelverhältnisse r1, r2,r3, r4 berech- 
net: 
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22 12 
ot ot 
21 u 
ori ot 
20 10 
ot ot 
© © 
x 20 x 10 
a a a 
© © 
x21 x 12 
a a a 
© 
x2 x 12 
a a 


02 02 12 22 
Xd X; x 
© © (J © 
o1 o1 
Xa x x. x2 
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© ec 
© © 
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Xap Xp % x 
© © © © 
01 on u 21 
Xab x x x, 
© © @ © 
02 02 12 22 
Xu x X; X 


Abbildung 3: Eckpunkte für alle Konfigurationen von Kartonagen mit Seitenverhältnis 
2:1. Blaue Punkte sind Teil der Referenzkartonage. Grüne Punkte sind infe- 
rierte Punkte erster Ordnung. Gelbe Punkte sind inferierte Punkte zweiter 


206 


rı = 


r? = 


r3 = 


r4 = 


Ordnung. 
1 
2 
00. 00 „10 


00 
Xe Xog Xd Xq 


2 
a 


2 
ee 


1 
ee 


1 
TE 


2 
2 la 


2 
ne af 


00 „00. ,.00 


Bee re 


10) = [a x00; 00,210 


_ L+(L/2+B) _ L+2B A 
L/2.(L+B) L+B' 

ern, 

_i-(@/241) 3 A 
Bar) 2 

Sak 2 aa ae 
B-(B/2+B) 3 A 
B/2. (B+B) 2° 

= oP 2a, A 

_ B-(B/2+L) B+2L 6 
B/2-(B+L) B+L' 


Perspektiveninvariante Inferenz von Eckpunkten 


Interessanterweise ergeben sich nur drei unterschiedliche Werte, wo- 
bei rı und r4 von L und B abhängen und r2 = r3 = 3/2 identisch und 
unabhängig vom Seitenverhältnis der Kartonage sind. 


x, L x” 
B 
L 
© 
x,0 Xa x,” B x1 x2 
~< >< > 
L/2 L/2 


Abbildung 4: Mit bekannter Lange L und Breite B der Referenzkartonage können vier 
mögliche Doppelverhältnisse berechnet werden (siehe Text). 


Eckpunktberechnungen 


Als nächstes wird die Konfiguration aus einer beliebigen Perspektive 
betrachtet, wie in Abbildung 5 dargestellt. Unter Verwendung homoge- 
ner Koordinaten können Hilfspunkte xe, xf und xg aus den Gleichun- 
gen 6 - 8 berechnet werden. Hierbei steht x für das dreidimensionale 
Kreuzprodukt und x; steht für den Koordinatenvektor des Punktes x; 
. Jede Gerade durch die Punkte x; und x; wird durch einen Vektor lij 
parametrisiert: 


lac = Xa X Xc, lba = Xb X Xa, Xe = lac X Iba, (6) 
lap = Xa X Xp, la = Xe X Xq, Xg = lab X lea, (7) 
lad = Xa X Xa, lbe = Xb X Xe, X = lad X Ibe- (8) 


Damit benachbarte Punkte über das Doppelverhältnis nur aus den 
vier Referenzpunkten inferiert werden können, werden die vier Seiten- 
halbierende Xap, Xca, Xbc, Xda als Hilfspunkte eingeführt. Die Seitenhal- 
bierenden x;;, i # j € {a,b,c,d} werden konstruiert, indem die Gerade 
auf dem einer der Fluchtpunkte x,,x; und der Mittelpunkt der Refe- 
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Abbildung 5: Konfiguration bei beliebiger Perspektive. 


renzkartonage x. liegt, mit einer Seite des Vierecks geschnitten wird: 


Xab = lfe X lab, Xbe = lge x Ibe, (9) 
Xed = lfe X Lea, Xda = lge x laa- (10) 
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Mit den Seitenhalbierenden und dem Doppelverhältnis können wir 
dann den Abstand A zu einem Nachbarpunkt x}? berechnen: 


pe [xa — Xp] [lXab xp || [xa — xoll (([Xab — xoll +A) 
[xab — Xb || ||xa — G71] (|! Xab — xoll (Xa — Xb] +4)” 
a — Ar) (ab = x0 /Xa = xoll) (11) 
11||Xab — Xp] — [Xa — xo || 


Schließlich ergeben sich die Koordinaten des Punktes cad wie folgt: 


Xp — Xa 


10 
x = Xp +À ; 
i lxb — xall 


(12) 


Auf die gleiche Weise ist es möglich alle anderen Eckpunkte erster 
Ordnung zu berechnen. Den Eckpunkt zweiter Ordnung x?! erhält man 
durch 


laog = x x Xf, laog = xt X Xg (13) 


=x!!! = laog x Lorg x (14) 


Durch das Schneiden von Geraden, die aus den inferierten Eckpunk- 
ten erster Ordnung mit den Fluchtpunkten gebildet werden, können so 
auch alle anderen Eckpunkte zweiter Ordnung ermittelt werden. 


3 Statistische Analyse und Parametrisches Modell 


In der Abbildung 6 wurde die im Abschnitt 2 vorgestellte Methode auf 
ein Beispielbild angewandt. Dabei wurden die Eckpunkte xa, Xp, Xc, Xq 
einer Kartonage im Bild als gemessene Referenzpunkte angenommen. 
Dann wurden auf Basis einer bivariaten Normalverteilung N (u, £) mit 
Mittelwertvektor u = x; und Kovarianzmatrix Ł, die der Einheitsmatrix 
entspricht, für jeden Eckpunkt 2000 verrauschte Eckpunkte berechnet. 
Anschließend wurden für alle 2000 Konfigurationen die Eckpunkte ers- 
ter und zweiter Ordnung berechnet. 

Daraufhin wurden mit Hilfe des Expectation-Maximization Algo- 
rithmus [6] die resultierenden Verteilungen der inferierten Eckpunk- 
te durch eine bivariate Normalverteilung approximiert (siehe Abbil- 
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Abbildung 6: Mögliche Konfiguration von Kartonagen. Blau gefärbt sind die 2000 ver- 
rauschten Referenzeckpunkte. Grün sind die daraus resultierenden infe- 
rierten Eckpunkte erster Ordnung und orange die inferierten Eckpunkte 
zweiter Ordnung. 


dung 8). Es ist deutlich zu erkennen, dass die Orientierungen der Nor- 
malverteilungen sehr gut mit der Orientierung der Verbindungslini- 
en zwischen dem Mittelpunkt der Referenzkartonage und dem jewei- 
ligen inferierten Punkt übereinstimmen. Abbildung 7 zeigt die Stan- 
dardabweichungen 01,02 der ersten und zweiten Hauptkomponente 
aller bivariaten Normalverteilungen in Abhängigkeit von der Distanz 
des urspünglichen inferierten Punktes zum Mittelpunkt der Kartonage 
Xe. Für die erste Hauptkomponente kann die Abhängigkeit durch ein 
Polynom zweiter Ordnung p(x) approximiert werden, für die zwei- 
te Komponente ist die Korrelation durch eine Gerade (x) beschrie- 
ben. Hieraus lässt sich ein Modell für den Einfluss des normalver- 
teilten Rauschens auf die inferierten Punkte ableiten. Sei xj” ein von 
der ursprünglichen Konfiguration inferierter Punkt und u der Vektor 
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200 400 600 800 1000 1200 200 400 600 800 1000 1200 
Null llull 


Abbildung 7: Abhängigkeit der Standardabweichungen 01,02 der beiden Hauptkompo- 
nenten von der Distanz ||1|| in Pixel. 


von x. zu xi", d.h. u = xj" — xe. Als nächstes definiert man den 
Vektor v := (—-u2,u1), sodass u und v senkrecht aufeinander stehen. 
Dann ergibt sich ein parametrisches Modell für die Verteilung von 
x" in Abhängigkeit von xe, das einer bivariaten Normalverteilung 
N (p, ZI) mit pe" =a" und 


a= (pp) C. aa) arte) 


entspricht. In Abbildung 8 sind beispielhaft fiir vier inferierte Punk- 
te die Verteilungen mit den 20 Konturen der Normalverteilung, wel- 
che durch den EM-Algorithmus berechnet wurde und der Normalver- 
teilung, die mit dem Modell ermittelt wurde, dargestellt. Sowohl das 
Modell der Orientierung, als auch das lineare Modell der Standardab- 
weichung entlang der zweiten Hauptkomponente passen sehr gut mit 
der normalverteilten Statistik der Datenpunkte tiberein. Bei der Stan- 
dardabweichung der ersten Hauptkomponente ergeben sich teilweise 
vertretbare Abweichungen. In zuktinftigen Arbeiten sollte das hier vor- 
gestellte datengetriebene Modell durch eine stringente Berechnung der 
Fehlerfortpflanzung mittels der Formel (12) unter Annahme von nor- 
malverteilten Koordinaten der Referenzeckpunkte verifiziert werden. 
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(a) x-Achse: —15 bis 10 Pixel, (b) x-Achse: —5 bis 5 Pixel, 
y-Achse: —15 bis 10 Pixel. y-Achse: —5 bis 5 Pixel. 


(c) x-Achse: —60 bis 80 Pixel, (d) x-Achse: —30 bis 30 Pixel, 
y-Achse: —80 bis 60 Pixel. y-Achse: —30 bis 30 Pixel. 


Abbildung 8: Verteilungen für vier inferierte Eckpunkte. Die grüne Ellipse ist die 20 
Kontur der gefitteten Normalverteilung, die orangene Ellipse ist die 20 
Kontur der aus dem Modell berechneten Normalverteilung. Der rote 
und grüne Vektor zeigt die Vorzugsrichtungen der EM-Normalverteilung, 
orange und cyan, die der Modell-Normalverteilung. 


4 Zusammenfassung 


Es konnte gezeigt werden, dass im Prinzip alle möglichen Eckpunk- 
te rechteckiger Elemente, die zu einem regelmäßigen, dicht besetz- 
ten, flächigen Muster zusammengesetzt werden, ausgehend von vier 
gemessenen Eckpunkten eines beliebigen Elements aus dieser Anord- 
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nung, über das Doppelverhältnis und das Wissen über die Abmaße des 
Rechtecks berechnet werden können. Insbesondere bei extremer Per- 
spektive können sich Vorteile ergeben, da trotz starker Abnahme der 
Bildauflösung der einzelnen projizierten Rechtecke mit zunehmender 
Distanz zur Kamera, die Lage der Eckpunkte genau berechenbar ist. 
Ungenauigkeiten bei der Messung der Eckpunktkoordinaten des Refe- 
renzrechteckes pflanzen sich weniger stark fort, je größer der Abstand 
und kleiner die Auflösung in der Projektion sind. 

Damit kann dieses Verfahren nicht nur zur Lageerkennung von Kar- 
tonagen, sondern auch auf andere Muster, die aus rechteckigen Ele- 
menten bestehen adaptiert werden, wie beispielsweise Parkettböden, 
Mauerwerk oder schachbrettartige Kalibriermuster. Die Fehlervertei- 
lung lässt sich ziemlich genau mit einer ortsvarianten zweidimensiona- 
len Normalverteilung approximieren, wobei die ortsabhängigen Werte 
des Erwartungswertvektors und der Kovarianzmatrix in Abhängigkeit 
vom Mittelpunkt des Referenzrechtecks und der Koordinaten des zu 
inferierenden Eckpunktes berechnet werden können. Damit ergibt sich 
ein vollständiges probabilistisches Modell, das beispielsweise als Po- 
tentialfunktion in einem Markovschen Zufallsfeld zur Bestimmung des 
vollständigen Packmusters von Kartonagen benutzt werden kann. 
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Zusammenfassung Der Brechungsindex stellt eine wesentliche 
Eigenschaft einer optischen Komponente dar. Dabei ist es oft 
wünschenswert diesen material- und prozessabhängigen Para- 
meter 2dimensional mit einer möglichst hohen Ortsauflösung zu 
erfassen. Dies trifft auch für den Fall additiv gefertigter Opti- 
ken zu. Solche Komponenten können z.B. mit Hilfe der Photo- 
polymerisation realisiert werden. Dabei wird mit UV Strahlung 
die Aushärtung eines flüssigen Polymer ermöglicht. Wesentlich 
ist dabei, dass der Aushärtegrad und der damit in Zusammen- 
hang stehende Brechungsindex von den Bestrahlungseigenschaf- 
ten abhängt. Somit ergibt sich neben dem materialabhängigen 
Brechungsindex auch eine prozessbedingte Abhängigkeit des 
Brechungsindexes. Im Umkehrschluss können damit über eine 
Messung der 2dimensionalen Brechungsindexverteilung auf den 
Aushärtegrad einer Probe geschlossen und Prozessparameter de- 
finiert werden. 

In diesem Beitrag soll eine Möglichkeit der orts- und zeitauf- 
gelösten Vermessung des Brechungsindexes während und nach 
der UV Aushärtung von Polymeren vorgestellt und diskutiert 
werden. Die Grundlage dafür bildet ein Messansatz basierend 
auf Totalreflexion. Gewonnen werden mit dem Messaufbau Bil- 
der, welche die Information des lokal vorliegenden Brechungsin- 
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dex enthalten. Diese können entweder analytisch, oder mit Hilfe 
eines neuronalen Netzes ausgewertet werden. 


Schlüsselwörter Brechungsindex, optische Messung, additive 
Fertigung, Stereolithografie 


Abstract The refractive index represents an essential property 
of an optical component. It is often desirable to measure 
this material- and process-dependent parameter 2-dimensionally 
with the highest possible spatial resolution. This also applies to 
the case of additively manufactured optics. Such components 
can be realized, for example, with the aid of photopolymer- 
ization. In this process, UV radiation is used to cure a liquid 
polymer. It is important to note that the degree of curing and 
the associated refractive index depend on the irradiation prop- 
erties. Thus, in addition to the material-dependent refractive 
index, there is also a process-dependent dependence of the re- 
fractive index. Conversely, a measurement of the 2-dimensional 
refractive index distribution can be used to infer the degree of 
curing of a sample and subsequently to define process parame- 
ters. 

In this paper, a possibility of spatially and temporally resolved 
measurement of the refractive index during and after UV cur- 
ing of polymers will be presented and discussed. The basis for 
this is a measurement approach based on total internal reflec- 
tion. Images are obtained with the measurement setup, which 
contain the information of the locally present refractive index. 
These can be evaluated either analytically or with the help of a 
neural network. 


Keywords Refractive index, optical metrology, additive manu- 
facturing, stereolithography 


1 Einleitung 


In vielen Bereichen und Anwendungen ist eine ortsaufgeöste Vermes- 
sung des Brechungsindexes wünschenswert. Dies trifft auch auf die 
additive Fertigung von optischen Komponenten zu und gilt vor allem 
im speziellen Fall der Photopolymerisation von flüssigen Harzen durch 
UV Bestrahlung. 
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Ein entsprechender Versuchsaufbau, welcher die Untersuchung der 
Aushärtung von photonsensitiven Polymeren ermöglicht, ist in Abbil- 
dung 1 gezeigt. Wie in Abbildung a) zu erkennen, wird das Licht mit 
Hilfe eines UV Projektors (Wintech Pro4500, A = 450nm, 912x1140 Pi- 
xel) in Form einer Pixelmaske erzeugt (Intel DLP System). Ein einzelnes 
Pixel weist dabei eine Kantenlänge von 35um auf. Das eingestellte Mas- 
kendesign wird über einen Umlenkspiegel und ein Objektträgerglas 
(Substrat) auf das flüssige Polymer abgebildet, um dieses auszuhärten 
(s. Abbildung b). Dabei härtet lediglich das Polymer in den belich- 
teten Bereichen aus, wobei es auch zu einem Übersprechen der ein- 
zelnen aktiven Pixel in Nachbarbereiche kommt. Da der sich einstel- 
lende finale Brechungsindex vom Aushärtegrad des Polymers sensitiv 
abhängt, ist direkt einsichtig, dass je nach Maske sich lokal unterschied- 
liche Brechungsinidzes ergeben können. Ein mögliches Maskendesign 
ist in Abb. c) dargestellt. Hierbei wurde jede zweite Pixelspalte akti- 
viert. Zusätzlich ist auch zu erkennen, dass fertigungsbedingt die ein- 
zelnen Pixel in der Mitte einen „schwarzen Punkt“, also einen Bereich 
aufweisen, in dem sie kein UV Licht auf die Probe senden können. 
Auch dies führt aufgrund der Inhomogenität in der Bestrahlung zu 
einer Inhomogenität in der Brechungsindexverteilung. 

Da also die lokale UV Bestrahlungsstärke den lokalen Aushärtegrad 
des Polymers bestimmt, und dieser wiederum mit dem sich dabei erge- 
benden Brechungsindex korreliert, kann so über die Prozessparameter 
die lokale Eigenschaft einer additiv gefertigten Optik manipuliert wer- 
den. Um für eine gewünschte Brechungsindexverteilung die richtigen 
Prozessparameter einstellen zu können, ist vorab der nichtlineare Zu- 
sammenhang zwischen Prozessparameter (lokale Bestrahlungsstärke) 
und lokaler Brechungsindex zu erarbeiten. Dies erfordert eine orts- 
und zeitaufgelöste Messung des Brechungsindexes während der Be- 
strahlung, welche hier diskutiert werden soll. 


2 Experimenteller Aufbau 


Der realisierte Messaufbau zur örtlichen und zeitlichen Vermessung 
des Brechungsindexes ist in Abbildung 2 dargestellt und stellt eine 
Weiterentwicklung ( [1], [2]) des in der Literatur diskutierten scan- 
ning focused refractive index microscopes dar [3]. Auf einem Prisma 
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Rayner PL Substrata 


b) ausgehärtete Struktur 
flüssiges 
Polymer 


Substrat 


Projektor 


Abbildung 1: a) Experimenteller Aufbau zur Untersuchung der UV Aushärtung von 
Photopolymeren; b) schematische Darstellung des Lichtwegs; c) Beispiel 
einer Pixel-basierten Beleuchtung: alternierend wurde eine Pixelspalte an 
bzw. aus geschaltet. In der Mitte eines jeden Pixels ist ein fertigungsbe- 
dingter ,,Totbereich” zu erkennen. 
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a) 


Prisma 
N-SF11 (n, = 1.78) 
XYZ-Scanning 


Kamerabild 


Brechungs- 
index 


hell/dunkel Übergang = 
Brechungsindex Probe 


b) 


Prisma 
N-SF11 (n, = 1.78) 
XYZ-Scanning 


Abbildung 2: a) SFRIM - optisch Vermessung des Brechungsindexes basierend auf Total- 
reflexion. b) Weiterentwicklung: LineSFRIM - zeitlich und ortsaufgelöste 
Vermessung des Brechungsindexes während der UV Polymerisation. 
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(Abbildung a) / SFRIM Verfahren) befindet sich die zu untersuchende 
Probe - z.B. ein flüssiges Polymer, dessen Brechungsindex bestimmt 
werden soll oder eine plane Probe, welche über eine „Index-matching” 
Flüssigkeit in Kontakt mit dem Prisma gebracht wird. Ein Laserstrahl 
wird auf die Grenzfläche zwischen Probe und Prisma fokussiert. Auf- 
grund der Fokussierung treffen Lichtstrahlen aus einem bestimmten 
Winkelbereich auf die Probe. Diese Lichtstrahlen werden entsprechend 
der Bedingung für Totalreflexion entweder total reflektiert, oder aus 
dem Prisma ausgekoppelt. Der kritische Winkel, unter dem Totalrefle- 
xion auftritt, ist dabei gegeben durch: 

be = arcsin ( toate ) 
Durch Bestimmung des kritischen Winkels kann also auf den Bre- 
chungsindex der Probe zurtick geschlossen werden. 
Wird nun tiber einen Kamera Chip das total reflektierte Licht auf- 
genommen, so kann aus der Position der Kante des hell - dunkel 
Ubergangs der Totalreflexionswinkel und damit direkt der lokale Bre- 
chungsindex (gemittelt tiber die Flache des Fokuspunktes des Laser- 
strahls - hier Durchmesser ca. 241m) der Probe bestimmt werden. Für 
eine quantitative Auswertung ist dabei vorab eine Kalibrierung des 
Systems (bzw. der hell - dunkel Kante) mit Proben mit bekanntem Bre- 
chungsindex notwendig. Hierzu wurde eine Bandbreite von kommer- 
ziellen Index-Matching Gele verwendet. 
Grundsätzlich kann so im Fokuspunkt des Lasers der Brechungsin- 
dex der Probe bestimmt werden. Um eine ortsaufgeléste Messung zu 
ermöglichen, wird über einen x-y Verschiebetisch das Prisma relativ 
zum Laserstrahl bewegt. Letztendlich erhält man dadurch eine 2dimen- 
sionale Vermessung der Brechungsindexverteilung an der Oberfläche 
der Probe mit einer Auflösung von 2um. Eine zeitlich aufgelöste Mes- 
sung der Aushärtung des Polymers ist in dieser Konfiguration nicht 
möglich. 

Für eine zeitaufgelöste Vermessung der Brechungsindexverteilung 
während der Aushärtung wurde das System modifiziert (s. Abb. b / 
LineSFRIM Verfahren). In diesem Fall wird statt eines punktförmigen 
Fokus ein linienförmiger Fokus auf das Interface Prisma / Probe ge- 
bracht. Es ergibt sich somit die Möglichkeit einer ortsaufgelösten Ver- 
messung der Brechungsindexverteilung entlang der Fokusline. Des 
Weiteren wird der oben erwähnte UV Projektor zur Aushärtung 
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des Polymers mit in den Aufbau integriert, um das Polymer aus- 
zuhärten und parallel die Brechungsindexänderung entlang der Fo- 
kusline während des Aushärtevorgangs untersuchen zu können. Die 
zeitliche Auflösung des Systems ist dabei durch die Framerate der Ka- 
mera bestimmt. 


3 Ergebnisse 


In Abbildung 3 sind Ergebnisse der beiden Verfahren SFRIM und Li- 
neSFRIM gegenübergestellt. Wie im jeweiligen rechten Bildabschnitt zu 
erkennen, wurde als Maske für den Projektor ein alternierendes Mus- 
ter aus jeweils 5 angeschalteten Pixelreihen (helle Reihen) und 5 aus- 
geschalteten Pixelreihen (dunkle Bereiche) zur Aushärtung verwendet. 
Jedes aktive Pixel führt dabei zu einer lokalen Aushärtung des UV Po- 
lymers am Interface des Prismas und damit zu einer Änderung des 
Brechungsindexes. 

Betrachtet werden soll zunächst die punktförmige Messung basie- 
rend auf der SFRIM Methode, welche in Abbildung a) gezeigt ist. Das 
Ergebnisbild ist durch ein abscannen der Probe in y-Richtung (senk- 
recht zu den aktivierten Pixelreihen / gelbe Punkte) über einen Be- 
reich von ca. 0,5mm entstanden. Jede horizontale Zeile im Ergebnisbild 
entspricht dabei einer Messung für einen bestimmten y-Wert auf der 
Probe. In dieser Messung wird der hell-dunkel Übergang mit Hilfe der 
Kamera aufgenommen und als Zeile im Ergebnisbild dargestellt. Die 
x-Achse im Ergebnisbild entspricht damit dem Brechungsindex. Wie in 
der Abbildung zu erkennen, zeigt sich entlang der punktierten Linie ei- 
ne Erhöhung des Brechungsindexes (hell - dunkel Kante „wandert nach 
rechts”) für den Fall, dass eine Aushärtung der Probe stattfindet. In den 
unbeleuchteten Bereichen bleibt die Position bzw. der Brechungsindex 
unverändert. 

In Abbildung b) wird der Versuch wiederholt, nur wird zur Vermes- 
sung der Brechungsindexverteilung das LineSFRIM Verfahren verwen- 
det. Aufgrund dessen, dass der Linienfokus (gelbe Linie) den kom- 
pletten Bereich überstreicht, handelt es sich hier um eine einzige Mes- 
sung, d.h. das dargestellte Ergebnisbild entspricht einer einzigen Auf- 
nahme. Es ist erneut eine identische Ausprägung der Brechungsindex- 
verteilung zu verzeichnen. Allerdings ist dabei auch ein wesentlicher 
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Nachteil dieser 20x schnelleren „one shot“ Methode zu erkennen: die 
örtliche Auflösung ist reduziert. Dennoch ist sie ausreichend, um zeit- 
liche Vorgänge während der UV Bestrahlung zu untersuchen. 


b) 
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Abbildung 3: a) SFRIM und b) LineSRIM Vermessung der Brechungsindexverteilung 
nach der Aushärtung eines Polymers mit Hilfe eines Linienmusters einer 
UV DMD Beleuchtung. 


In Abbildung 4 ist eine zeitaufgelöste Messung der Brechungsindex- 
verteilung während der Aushärtung mit Hilfe der LineSFRIM Metho- 
de gezeigt. Abbildung a) zeigt die schematische Anordnung in Seiten- 
und Draufsicht. Auf das Polymer wird in diesem Fall ein UV Puls (vio- 
lett) mit Hilfe einer UV LED oberhalb des Polymers gegeben, welcher 
zur Aushärtung führt. In y-Richtung wird mit Hilfe des Linienfokus 
(grün) zeitaufgelöst die Brechungsindexverteilung entlang des Linien- 
fokus auf der Probe gemessen. Abbildung b) zeigt die Aufnahme vor 
der Belichtung. Entlang der Linie ist ein homogener Brechungsindex 
zu verzeichnen. Erfolgt nun der UV Puls, so kommt es zur Aushärtung 
und damit zu einer Brechungsindexänderung in diesem Bereich. Dies 
äußert sich im Bild durch die lokale Verschiebung des hell / dunkel 
Übergangs in x-Richtung (Abbildung c) Aufnahme nach ca. 0.5s). Der 
Brechungsindex steigt im belichteten Bereich weiter an und erreicht 
einen Maximalwert nach ca. 1 Sekunde (Abbildung d). Im weiteren 
folgt eine Verbreiterung des Aushärtebereichs in y-Richtung, bis letzt- 
endlich ca. 1/2 des betrachteten Bereiches in y-Richtung ausgehärtet 
ist. Auf diese Weise kann somit über die Brechungsindexverteilung die 
Kinetik der Aushärtung untersucht werden. 

Nach einer entsprechenden Kalibrierung des Bildes, kann aus diesem 
die Information des lokalen schwarz / weiß Übergangs in eine Dar- 
stellung des Brechungsindexes gegenüber dem Ort aufgetragen wer- 
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Abbildung 4: a) SFRIM und b) LineSRIM Vermessung der Brechungsindexverteilung 
nach der Aushärtung eines Polymers mit Hilfe eines Linienmusters einer 
UV DMD Beleuchtung. 
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den, wie es in Abbildung 5 dargestellt ist. Die Abbildung zeigt dabei 
den Brechungsindexverlauf entlang der y-Richtung vor der Belichtung 
(Start) und nach mehreren sequentiellen Belichtungen mit Hilfe des 
Projektors. Als Belichtungsmaske wurde dabei eine Sequenz von 10 
angeschalteten und 5 ausgeschalteten Pixeln verwendet - beginnend 
mit 108 mJ/cm?, gefolgt mit einer erhöhten Dosis von 215 mJ/cm? und 
abschließend eine Bestrahlung mit allen angeschalteten Pixeln bei 855 
mJ/cm?. Wie zu erkennen ist, zeigt sich noch eine klare Differenzie- 
rung zwischen aktiven und nicht aktiven Pixel und ein daraus resultie- 
render ausgehärteter und nicht ausgehärteter Bereich für die geringste 
Energiedosis. Bei höherer Dosis zeigt sich ein Überstrahlen in eigent- 
lich nicht beleuchtete Bereiche, so dass es auch dort zur Aushärtung 
kommt. Es zeigt sich aber auch, dass bei einer kompletten Beleuch- 
tung der Probe, bei der alle Pixel aktiviert wurden, Variationen im 
Brechungsindex und damit im Aushärtegrad in den Bereichen der ur- 
sprünglich nicht aktiven Pixel vorliegen. 

Zur Auswertung selbst wurden dabei 2 Vorgehensweisen gewählt — 
zum einen eine klassische Auswertung des Fresnel Fits und zum ande- 
ren die Auswertung über ein neuronales Netz, welches ein identisches 
Ergebnis liefert aber um 2 Größenordnungen schneller die Auswertung 
durchführt. Für den Aufbau des neuronalen Netzes wurde das „Neural 
Net Fitting’ Tool von Matlab genutzt. Die Größe der Eingangsschicht 
entspricht dabei der Pixelanzahl einer Bildzeile (1280). Im verwende- 
ten Netz werden in den „hidden layers” 50 Neuronen verknüpft. Die 
Ausgangsschicht reduziert das Ergebnis auf nur einen Ausgabewert, so 
dass jede Bildzeile mit einem Grauwertverlauf über 1280 Pixel eine Po- 
sition der hell - dunkel Kante zugeordnet wird, was dem Brechungsin- 
dex entspricht. Die zum Anlernen des Netzes notwendigen Trainings- 
daten wurden synthetisch über den theoretisch bekannten Verlauf der 
Fresnel-Reflexionen generiert. 


4 Zusammenfassung 


Zusammengefasst wurde ein bildbasiertes Messsystem für die örtlich 
und zeitlich aufgelöste Vermessung der Brechungsindexverteilung rea- 
lisiert. Dieses kann für die Untersuchung des Aushärteverhaltens im 
Bereich der additiven Fertigung von optischen Komponenten einge- 
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Abbildung 5: Zeitliche Entwicklung der Brechungsindex-Verteilung während der UV 
Aushärtung eines Polymers. (Start: blaue Linie; rote und gelbe Linie: on 
/ off Modus einzelner Pixelreihen bei 108mJ/cm? und 215 mJ/cm?; violet- 
te Linie: alle Pixel des UV Projektes aktiviert) 


setzt werden. Eine wesentliche Beschleunigung der Auswertung ergab 
sich dabei durch den Einsatz eines neuronalen Netzes, so dass das Sys- 
tem auch eine „online“ Auswertung des Messergebnisses für zeitlich 
aufgelöste Vorgänge ermöglicht. 
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Bildverarbeitung ist definitionsgemäß die Wissenschaft von der Verarbeitung 
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